google / patents-public-data

Patent analysis using the Google Patents Public Datasets on BigQuery
https://bigquery.cloud.google.com/dataset/patents-public-data:patents
Apache License 2.0
539 stars 163 forks source link

Linking proteins and humangenes annotation preferred name to identifier #54

Open ibarshai opened 2 years ago

ibarshai commented 2 years ago

I'm looking at annotations in the proteins and humangenes domain in the google_patent_research.annotations table. It appears that the annotations themselves are normalized to their preferred name, but I was wondering if there is any way to link the preferred name of the gene annotation to some kind of unique identifier, such as HGNC, that can be used to ground these annotations?

I see that there is huge amount of information in the ebi_chembl section, with seemingly promising table names, but haven't spotted a useful connection by looking through the schemas.

wetherbeei commented 2 years ago

You can link via the OCID to tables in sciwalker-open-data.data:ontologies_2020r02 tables - @c-ruttkies may be able to help join from there to another identifier.

dhimmel commented 2 years ago

Thanks @wetherbeei for the assistance!

sciwalker-open-data:ontologies_2020r02:human_genes_preflabel looks promising in that it maps preferred labels to OCID identifiers:

Row ocid name date
1 102100007495 AR 2021-02-13  
2 102100018113 C2 2021-02-13  
3 102100012590 C3 2021-02-13  
4 102100007517 C5 2021-02-13  
5 102100007690 C6 2021-02-13

However, I am not sure what is an OCID and who mints them. But I think they are "Ontology Concept Identifiers" minted by SciWalker since https://bioregistry.io/ocid:102100007495 redirects to https://www.sciwalker.com/sciwalker/faces/ociddata.xhtml?ocid=102100007495.

Any advice @c-ruttkies on how to map OCID to other gene nomenclatures, such as ncbigene, ensembl, HGNC, etcetera would be much appreciated!

wetherbeei commented 2 years ago

The OntoChem team has added this table to map between OCIDs and other databases: sciwalker-open-data:ontologies_2020r02.ocid_xref

SELECT DISTINCT database FROM sciwalker-open-data.ontologies_2020r02.ocid_xref

1 | RefSeq |   2 | PDBeChem |   3 | MACROCYCL |   4 | OMIM |   5 | ORDO |   6 | ICD-O |   7 | ICD9CM_2006 |   8 | EPPO |   9 | physiology |   10 | GO |   11 | FEMA |   12 | NCI2004_11_17 |   13 | event |   14 | OC_REGIONS |   15 | InterPro |   16 | TC |   17 | HMDB |   18 | SUBMITTER |   19 | sn |   20 | ICD10CM |   21 | species |   22 | Beilstein |   23 | NCBITaxon |   24 | OrganicChemistryPortal |   25 | PMID |   26 | ChEMBL |   27 | UM-BBD_enzymeID |   28 | GARD |   29 | MESH |   30 | stedman |   31 | GC_ID |   32 | chemistry |   33 | Uniprot |   34 | GeneTree |   35 | DRUGBANK |   36 | Reactome |   37 | ICD-10 |   38 | KEGG |   39 | GeneID |   40 | GOC |   41 | MetaCyc |   42 | PubChem |   43 | SABIO-RK |   44 | Wikipedia |   45 | CiteXplore |   46 | OCID |   47 | EFO |   48 | tui |   49 | UMLS_CUI |   50 | EC |   51 | UM-BBD_reactionID |   52 | PermId |   53 | ls |   54 | CHEBI |   55 | RESID |   56 | NCI |   57 | SNOMECT |   58 | ChemIDplus |   59 | RHEA |   60 | ENMBRS |   61 | WIKICHEM |   62 | MTH |   63 | Reaxys |   64 | SNOMEDCT_US_2018_03_01 |   65 | SNOMEDCT_US_2019_03_01 |   66 | MeSH |   67 | UniPathway |   68 | DO |   69 | DERMO |   70 | ICD9CM |   71 | CSP2005 |   72 | SNOMEDCT_US_2018_09_01 |   73 | PDB |   74 | HGNC |   75 | Ensembl |   76 | WIKIDRUGS |   77 | JA |   78 | MTHICD9_2006 |   79 | Alan Wood's Pesticides |   80 | SNOMED_CT_US_2018_03_01 |   81 | anatomy |   82 | relations |

dhimmel commented 2 years ago

Thanks @wetherbeei and @c-ruttkies!

Here's a query showing which external databases human_genes_preflabel maps to via the ocids:

SELECT
  database, COUNT(*) AS n_mappings
FROM
  `sciwalker-open-data.ontologies_2020r02.human_genes_preflabel` AS human_genes_preflabel
INNER JOIN
  `sciwalker-open-data.ontologies_2020r02.ocid_xref` AS ocid_xref
ON
  human_genes_preflabel.ocid = ocid_xref.ocid
GROUP BY database
ORDER BY n_mappings DESC

Great to see so many external vocabularies, particularly Ensembl, GeneID, HGNC.

Row database n_mappings
1 Uniprot 225032  
2 InterPro 87289  
3 Ensembl 68954  
4 PDB 60276  
5 RefSeq 55689  
6 GO 28481  
7 HGNC 20319  
8 KEGG 19431  
9 GeneID 19174  
10 GeneTree 18969  
11 Reactome 18491  
12 EC 6039  
13 MetaCyc 1187  
14 RHEA 628  
15 UM-BBD_reactionID 33  
16 Wikipedia 32  
17 UM-BBD_enzymeID 16  
18 RESID 9  
19 TC 5  
20 GOC 1

@ibarshai did you want to go ahead and close this issue unless we have any other questions?

dhimmel commented 2 years ago

And here is an example query to map gene preferred labels to ensembl gene IDs that returns 21362 results:

SELECT
  DISTINCT human_genes_preflabel.name AS gene_preflabel,
  ocid_xref.database_id AS ensembl_id
FROM
  `sciwalker-open-data.ontologies_2020r02.human_genes_preflabel` AS human_genes_preflabel
INNER JOIN
  `sciwalker-open-data.ontologies_2020r02.ocid_xref` AS ocid_xref
USING
  (ocid)
WHERE
  ocid_xref.database = "Ensembl"
  # exclude ensembl protein mappings
  AND ocid_xref.database_id LIKE "ENSG%"
ORDER BY
  gene_preflabel,
  ensembl_id
Row gene_preflabel ensembl_id
1 A1BG ENSG00000121410  
2 A1CF ENSG00000148584  
3 A2M ENSG00000175899  
4 A2ML1 ENSG00000166535  
5 A3GALT2 ENSG00000184389  
6 A4GALT ENSG00000128274
dhimmel commented 2 years ago

Note that to map to patent annotations, it's not necessary to go through sciwalker-open-data.ontologies_2020r02.human_genes_preflabel, since ocid is available in patents-public-data.google_patents_research.annotations:

SELECT
  DISTINCT ocid_xref.database_id AS ensembl_gene_id,
  ocid,
  patent_annotations.preferred_name AS patent_annotation_preferred_name,
  human_genes_preflabel.name AS gene_preflabel,
  patent_annotations.publication_number,
  patent_annotations.domain,
  --   patent_annotations.source,
  --   patent_annotations.confidence,
  --   patent_annotations.character_offset_start,
  --   patent_annotations.character_offset_end
FROM
  `patents-public-data.google_patents_research.annotations` AS patent_annotations
INNER JOIN
  `sciwalker-open-data.ontologies_2020r02.ocid_xref` AS ocid_xref
USING
  (ocid)
LEFT JOIN
  `sciwalker-open-data.ontologies_2020r02.human_genes_preflabel` AS human_genes_preflabel
USING
  (ocid)
WHERE
  ocid_xref.database = "Ensembl"
  # exclude ensembl protein mappings
  AND ocid_xref.database_id LIKE "ENSG%"
  # restrict to single patent publication for development
  AND patent_annotations.publication_number = "JP-3820105-B2"
LIMIT
  1000
Row ensembl_gene_id ocid patent_annotation_preferred_name gene_preflabel publication_number domain  
1 ENSG00000134853 102100004940 Platelet-derived growth factor receptor alpha PDGFRA JP-3820105-B2 humangenes  
2 ENSG00000111537 102100016020 Interferon gamma IFNG JP-3820105-B2 humangenes  
3 ENSG00000115008 102100018955 Interleukin-1 alpha IL1A JP-3820105-B2 humangenes  
4 ENSG00000130427 101000292394 human Erythropoietin null JP-3820105-B2 proteins  
5 ENSG00000174059 102100016492 Hematopoietic progenitor cell antigen CD34 CD34 JP-3820105-B2 humangenes  
6 ENSG00000025708 102100008604 Thymidine phosphorylase TYMP JP-3820105-B2 humangenes  
7 ENSG00000138798 102100010813 Pro-epidermal growth factor EGF JP-3820105-B2 humangenes  
8 ENSG00000102265 102100002607 Erythroid-potentiating activity TIMP1 JP-3820105-B2 humangenes  
9 ENSG00000214274 102100018024 Angiogenin ANG JP-3820105-B2 humangenes  
10 ENSG00000134531 102100010285 Epithelial membrane protein 1 EMP1 JP-3820105-B2 humangenes  
11 ENSG00000109270 101000411504 human Ragulator complex protein LAMTOR3 null JP-3820105-B2 proteins
ravwojdyla commented 2 years ago

👋 @wetherbeei noticed the 2022 release of the "ontologies" at sciwalker-open-data:ontologies_2022r01, would it be possible to publish ocid_xref for 2022 as well please?

ravwojdyla commented 2 years ago

Also ping @c-ruttkies, would appreciate your help as well please ^

ravwojdyla commented 2 years ago

FYI ocid_xref for 2022 was published, thanks OntoChem/Sciwalker. This issue can probably be closed btw.