openkinome / kinodata

Collection of scripts / notebooks to reliably select datasets
MIT License
27 stars 18 forks source link

What is a kinase? #9

Closed jaimergp closed 3 years ago

jaimergp commented 4 years ago

This PR adds more details to the human-kinases notebook. It relies on human-supervised curation of the existing records using automated methods validated by extensive literature references.

More sources

jaimergp commented 3 years ago

Possible explanation for discrepancies between UniProt queries and other kinase-specific (published) sources:

# Our initial query (wrong): 637 results
https://www.uniprot.org/uniprot/?query=keyword:%22Kinase%20[KW-0418]%22&fil=organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22+AND+reviewed%3Ayes

#Helge's query: 493 (too conservative)
family:"protein kinase superfamily" AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]"

# Too restrictive (only ser/trh and tyr kinases): 473
(keyword:"Tyrosine-protein kinase [KW-0829]" OR keyword:"Serine/threonine-protein kinase [KW-0723]") AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]"

# PIK and friends: 12 extra kinases
https://www.uniprot.org/uniprot/?query=ec:2.7.1.-%20family:%22pi3%20pi4-kinase%20family%22&fil=organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22+AND+reviewed%3Ayes

# How to query for caution messages
https://www.uniprot.org/uniprot/?query=annotation%3A%28type%3Acaution+kinase%29+reviewed%3Ayes+organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22+NOT+name%3Akinase&sort=score

Uniprot also lists all entries by family here: https://www.uniprot.org/docs/similar.txt

jaimergp commented 3 years ago

@AndreaVolkamer: Last commit fixes the issue we had with UniProt queries being too broad. Turns out we were querying for all kinases, not just protein kinases. @hlgvth correctly pointed out that nuance, so now the number went down from 700ish to 493, which is what we can observe in other datasets.

That said, some datasets add extra kinases to their lists. KLIFS for example adds phosphatidylinositol kinases, saccharide kinases, and other proteins that bind ATP (a total of 38 more).

jaimergp commented 3 years ago

Cleaned activities for kinases in ChEMBL 27 (preview before new data release).

activities-chembl27_2.7z.zip

jaimergp commented 3 years ago

Data for the new ChEMBL 28. activities-chembl28_2.zip

The dataset grew from 218101 / 174649 to 237830 / 187387 (raw / curated), so yay 10K extra data points!

review-notebook-app[bot] commented 3 years ago

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2021-02-22T11:51:11Z ----------------------------------------------------------------

This website might be down. I can't access it or retrieve any data.


review-notebook-app[bot] commented 3 years ago

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2021-02-22T11:51:12Z ----------------------------------------------------------------

our_uniprot only contains around 500 entries too


jaimergp commented 3 years ago

Thanks @schallerdavid! Merging now and cutting a new release.