monarch-initiative / monarch-app

Monarch Initiative website and API
https://monarchinitiative.org/
BSD 3-Clause "New" or "Revised" License
16 stars 3 forks source link

Dangling edges found for STRING. #722

Open RichardBruskiewich opened 1 year ago

RichardBruskiewich commented 1 year ago

Dangling edges found for STRING. See https://monarch-initiative.github.io/monarch-qc/ for the report; https://data.monarchinitiative.org/monarch-kg-dev/ for the data.

Find out why and suggest a repair.

RichardBruskiewich commented 1 year ago

OK... preliminary investigation of this particular anomaly turns up something truly odd (but no clear idea why...).

If the original_subject and original_object NCBIGene identifiers, from the "dangling edges QC file" from the String ingest, are searched for in the output file ncbi_gene_nodes.tsv data file from the NCBI ingest, one comes up empty handed: no matches!

What is a little bit odd about this is that those same identifiers seem to be found directly by searching the NCBI "Gene" web site itself.

For example, as of this week (dangling edges file downloaded circa May 10, 2023), the first line of the file asserts

NCBIGene:161003 biolink:interacts_with  NCBIGene:41

with

NCBIGene:161003 and NCBIGene:41 found in the NCBI 'Gene' database.

Other edges show a similar anomaly.

The NCBI gene ingest is very simple in character but it does have a taxonomic filter which could be removing entries.

I guess a closer review of the NCBI Gene input file and how it is being filtered out by the NCBI ingest script, relative to what identifiers are encountered in the STRING ingest, should shed some light on the situation.

kevinschaper commented 1 year ago

All of this detective work would be so much easier with taxon info on subjects and objects in the edge files.

RichardBruskiewich commented 1 year ago

The NCBI Gene hits (in the dangling edges file) all seem to have HGNC identifiers suggesting that they are of human origin.

The analysis doesn't explain all of the dangling edges: there is a block of edges at the end of the file which are missing their original_subject and original_object NCBIGene identifiers. I'll have to check why those are getting through (that's probably an ingest bug).

RichardBruskiewich commented 1 year ago

I guess the NCBIGene ingest only imports gene identifiers for a select subset of species: dog, cattle, pig and chicken... no wonder the STRING output identifiers don't all match to that ingest data!

Rather, we need to look further afield to the general gene mapping ingests we make elsewhere. Thus, we could pose the question: Is the SSSOM data from the Monarch Gene Mappings complete?

RichardBruskiewich commented 1 year ago

@kevinschaper made good progress in removing many of the previously observed dangling edges; however, the job is not done (as of I write, 12th June 2023 - ~ 20% still missing)

RichardBruskiewich commented 11 months ago

@kevinschaper, another quick exploration attempted this morning (19 July 2023)...

Using a newly written script in analyse_dangling_edges.py in the monarch-gene-mapping repository, I filtered out NCBIGene identifiers relating to the STRING ingest and recorded in the monarch-kg-dangling-edges.tsv.gz file.

I then searched (by simple grep) for one (the first) of these identifiers, NCBIGene:2872795, within the NCBI gene_info.gz file. The entry that matched was:

$ gunzip -c gene_info.gz |grep -P "\s2872795\s"
227321  2872795 ANIA_04997      ANIA_04997      AN4997.2|AN4997.4       -       III     -       hypothetical protein    protein-coding  -       -       -       hypothetical protein    20230413        -

which is obviously a gene prediction from the original genomic sequence assembly of the Aspergillus nidulans (now Emericella nidulans (strain FGSC A4 / ATCC 38163 / CBS 112.46 / NRRL 194 / M139)). Looking up this specific identifier in UniProtKB gives the entry Q5B383.

The rich UniProtKB entry is very suggestive of a real protein - Phosphatidylinositol transporter (Eurofung) - with functional evidence inferred from various directions (albeit, not totally clear how much of this is experimental data).

Once again, I'm wondering if a way forward with respect to resolving unmapped STRING identifiers might be to take those identifiers and search directly against the UniProtKB to pull out what annotation we need to properly ingest the corresponding Monarch nodes?

For completeness here, though, I need to go back to the dangling edges file entries that appeared to have the above NCBIGene identifier:

$ gunzip -c monarch-kg-dangling-edges.tsv.gz |grep "NCBIGene:2872795"
id      original_subject        predicate       original_object category        aggregator_knowledge_source     has_evidence    primary_knowledge_source        provided_by     publications    frequency_qualifier     negated onset_qualifier sex_qualifier   qualifiers      evidence        relation        stage_qualifier subject object
uuid:dbe48109-2136-11ee-873a-cd90a19c4085               biolink:interacts_with          biolink:PairwiseGeneToGeneInteraction   infores:monarchinitiative               infores:string  string_protein_links_edges                                                                              NCBIGene:2873697       NCBIGene:2872795
uuid:dcbb11c6-2136-11ee-873a-cd90a19c4085               biolink:interacts_with          biolink:PairwiseGeneToGeneInteraction   infores:monarchinitiative               infores:string  string_protein_links_edges                                                                              NCBIGene:2873254       NCBIGene:2872795
uuid:dd96f2c7-2136-11ee-873a-cd90a19c4085               biolink:interacts_with          biolink:PairwiseGeneToGeneInteraction   infores:monarchinitiative               infores:string  string_protein_links_edges                                                                              NCBIGene:2872795       NCBIGene:2873911
uuid:dd96f2c8-2136-11ee-873a-cd90a19c4085               biolink:interacts_with          biolink:PairwiseGeneToGeneInteraction   infores:monarchinitiative               infores:string  string_protein_links_edges                                                                              NCBIGene:2872795       NCBIGene:2873254
uuid:dd96f2c9-2136-11ee-873a-cd90a19c4085               biolink:interacts_with          biolink:PairwiseGeneToGeneInteraction   infores:monarchinitiative               infores:string  string_protein_links_edges                                                                              NCBIGene:2872795       NCBIGene:2873697
uuid:dd96f2ca-2136-11ee-873a-cd90a19c4085               biolink:interacts_with          biolink:PairwiseGeneToGeneInteraction   infores:monarchinitiative               infores:string  string_protein_links_edges                                                                              NCBIGene:2872795       NCBIGene:2870366
uuid:e2ec57f6-2136-11ee-873a-cd90a19c4085               biolink:interacts_with          biolink:PairwiseGeneToGeneInteraction   infores:monarchinitiative               infores:string  string_protein_links_edges                                                                              NCBIGene:2870366       NCBIGene:2872795
uuid:e7a5813b-2136-11ee-873a-cd90a19c4085               biolink:interacts_with          biolink:PairwiseGeneToGeneInteraction   infores:monarchinitiative               infores:string  string_protein_links_edges                                                                              NCBIGene:2873911       NCBIGene:2872795

since I didn't (yet) look at the companion identifiers in the given dangling edges. There is some duplication in these entries. The unique set of these identifiers are the following:

NCBIGene:2873697
NCBIGene:2873254
NCBIGene:2873911
NCBIGene:2870366

Searching the NCBI Gene Info archive file indicates, though, that those counterpart NCBI identifiers are themselves predicated genes from the original genomic sequence assembly of the Aspergillus nidulans albeit, with some interesting variability in annotation:

227321  2873697 ANIA_04278      ANIA_04278      AN4278.2|AN4278.4       -       II      -       1-phosphatidylinositol 4-kinase STT4    protein-coding  -       -       -       1-phosphatidylinositol 4-kinase STT4    20230716        -
227321  2873254 ANIA_03841      ANIA_03841      AN3841.2|AN3841.4       -       II      -       hypothetical protein    protein-coding  -       -       -       hypothetical protein    20230413        -
227321  2873911 ANIA_02877      ANIA_02877      AN2877.2|AN2877.4       -       VI      -       hypothetical protein    protein-coding  -       -       -       hypothetical protein    20230413        -
227321  2870366 ANIA_06709      ANIA_06709      AN6709.2|AN6709.4       -       I       -       Arf family guanine nucleotide exchange factor SEC7      protein-coding  -       
RichardBruskiewich commented 5 months ago

The STRING QC stats are looking better as of this date (23 Jan 2024). You could continue to iterate on the above ideas on a case-by-case basis, to see if the other dangling edges can be accounted for.

monicacecilia commented 1 month ago

We're at about 80% fix. ~200K we are not getting. Is there something we need in that set? ... E.g., Bgee was using older ENSEMBL IDs instead of current ones, and this was causing trouble. We need an investigator to take this on. BUT it is not urgent and should not consume anyone's entire time.

Dear @madanucd -- do you happen to have any bandwidth?

Else, this goes on the icebox.

sagehrke commented 1 month ago

Related to #726

madanucd commented 1 month ago

We're at about 80% fix. ~200K we are not getting. Is there something we need in that set? ... E.g., Bgee was using older ENSEMBL IDs instead of current ones, and this was causing trouble. We need an investigator to take this on. BUT it is not urgent and should not consume anyone's entire time.

Dear @madanucd -- do you happen to have any bandwidth?

Else, this goes on the icebox.

Sure, I can look into it.