Open RichardBruskiewich opened 1 year ago
OK... preliminary investigation of this particular anomaly turns up something truly odd (but no clear idea why...).
If the original_subject
and original_object
NCBIGene identifiers, from the "dangling edges QC file" from the String ingest, are searched for in the output file ncbi_gene_nodes.tsv data file from the NCBI ingest, one comes up empty handed: no matches!
What is a little bit odd about this is that those same identifiers seem to be found directly by searching the NCBI "Gene" web site itself.
For example, as of this week (dangling edges file downloaded circa May 10, 2023), the first line of the file asserts
NCBIGene:161003 biolink:interacts_with NCBIGene:41
with
NCBIGene:161003 and NCBIGene:41 found in the NCBI 'Gene' database.
Other edges show a similar anomaly.
The NCBI gene ingest is very simple in character but it does have a taxonomic filter which could be removing entries.
I guess a closer review of the NCBI Gene input file and how it is being filtered out by the NCBI ingest script, relative to what identifiers are encountered in the STRING ingest, should shed some light on the situation.
All of this detective work would be so much easier with taxon info on subjects and objects in the edge files.
The NCBI Gene hits (in the dangling edges file) all seem to have HGNC identifiers suggesting that they are of human origin.
The analysis doesn't explain all of the dangling edges: there is a block of edges at the end of the file which are missing their original_subject
and original_object
NCBIGene identifiers. I'll have to check why those are getting through (that's probably an ingest bug).
I guess the NCBIGene ingest only imports gene identifiers for a select subset of species: dog, cattle, pig and chicken... no wonder the STRING output identifiers don't all match to that ingest data!
Rather, we need to look further afield to the general gene mapping ingests we make elsewhere. Thus, we could pose the question: Is the SSSOM data from the Monarch Gene Mappings complete?
@kevinschaper made good progress in removing many of the previously observed dangling edges; however, the job is not done (as of I write, 12th June 2023 - ~ 20% still missing)
@kevinschaper, another quick exploration attempted this morning (19 July 2023)...
Using a newly written script in analyse_dangling_edges.py in the monarch-gene-mapping repository, I filtered out NCBIGene identifiers relating to the STRING ingest and recorded in the monarch-kg-dangling-edges.tsv.gz file.
I then searched (by simple grep) for one (the first) of these identifiers, NCBIGene:2872795, within the NCBI gene_info.gz file. The entry that matched was:
$ gunzip -c gene_info.gz |grep -P "\s2872795\s"
227321 2872795 ANIA_04997 ANIA_04997 AN4997.2|AN4997.4 - III - hypothetical protein protein-coding - - - hypothetical protein 20230413 -
which is obviously a gene prediction from the original genomic sequence assembly of the Aspergillus nidulans (now Emericella nidulans (strain FGSC A4 / ATCC 38163 / CBS 112.46 / NRRL 194 / M139)). Looking up this specific identifier in UniProtKB gives the entry Q5B383.
The rich UniProtKB entry is very suggestive of a real protein - Phosphatidylinositol transporter (Eurofung) - with functional evidence inferred from various directions (albeit, not totally clear how much of this is experimental data).
Once again, I'm wondering if a way forward with respect to resolving unmapped STRING identifiers might be to take those identifiers and search directly against the UniProtKB to pull out what annotation we need to properly ingest the corresponding Monarch nodes?
For completeness here, though, I need to go back to the dangling edges file entries that appeared to have the above NCBIGene identifier:
$ gunzip -c monarch-kg-dangling-edges.tsv.gz |grep "NCBIGene:2872795"
id original_subject predicate original_object category aggregator_knowledge_source has_evidence primary_knowledge_source provided_by publications frequency_qualifier negated onset_qualifier sex_qualifier qualifiers evidence relation stage_qualifier subject object
uuid:dbe48109-2136-11ee-873a-cd90a19c4085 biolink:interacts_with biolink:PairwiseGeneToGeneInteraction infores:monarchinitiative infores:string string_protein_links_edges NCBIGene:2873697 NCBIGene:2872795
uuid:dcbb11c6-2136-11ee-873a-cd90a19c4085 biolink:interacts_with biolink:PairwiseGeneToGeneInteraction infores:monarchinitiative infores:string string_protein_links_edges NCBIGene:2873254 NCBIGene:2872795
uuid:dd96f2c7-2136-11ee-873a-cd90a19c4085 biolink:interacts_with biolink:PairwiseGeneToGeneInteraction infores:monarchinitiative infores:string string_protein_links_edges NCBIGene:2872795 NCBIGene:2873911
uuid:dd96f2c8-2136-11ee-873a-cd90a19c4085 biolink:interacts_with biolink:PairwiseGeneToGeneInteraction infores:monarchinitiative infores:string string_protein_links_edges NCBIGene:2872795 NCBIGene:2873254
uuid:dd96f2c9-2136-11ee-873a-cd90a19c4085 biolink:interacts_with biolink:PairwiseGeneToGeneInteraction infores:monarchinitiative infores:string string_protein_links_edges NCBIGene:2872795 NCBIGene:2873697
uuid:dd96f2ca-2136-11ee-873a-cd90a19c4085 biolink:interacts_with biolink:PairwiseGeneToGeneInteraction infores:monarchinitiative infores:string string_protein_links_edges NCBIGene:2872795 NCBIGene:2870366
uuid:e2ec57f6-2136-11ee-873a-cd90a19c4085 biolink:interacts_with biolink:PairwiseGeneToGeneInteraction infores:monarchinitiative infores:string string_protein_links_edges NCBIGene:2870366 NCBIGene:2872795
uuid:e7a5813b-2136-11ee-873a-cd90a19c4085 biolink:interacts_with biolink:PairwiseGeneToGeneInteraction infores:monarchinitiative infores:string string_protein_links_edges NCBIGene:2873911 NCBIGene:2872795
since I didn't (yet) look at the companion identifiers in the given dangling edges. There is some duplication in these entries. The unique set of these identifiers are the following:
NCBIGene:2873697
NCBIGene:2873254
NCBIGene:2873911
NCBIGene:2870366
Searching the NCBI Gene Info archive file indicates, though, that those counterpart NCBI identifiers are themselves predicated genes from the original genomic sequence assembly of the Aspergillus nidulans albeit, with some interesting variability in annotation:
227321 2873697 ANIA_04278 ANIA_04278 AN4278.2|AN4278.4 - II - 1-phosphatidylinositol 4-kinase STT4 protein-coding - - - 1-phosphatidylinositol 4-kinase STT4 20230716 -
227321 2873254 ANIA_03841 ANIA_03841 AN3841.2|AN3841.4 - II - hypothetical protein protein-coding - - - hypothetical protein 20230413 -
227321 2873911 ANIA_02877 ANIA_02877 AN2877.2|AN2877.4 - VI - hypothetical protein protein-coding - - - hypothetical protein 20230413 -
227321 2870366 ANIA_06709 ANIA_06709 AN6709.2|AN6709.4 - I - Arf family guanine nucleotide exchange factor SEC7 protein-coding -
The STRING QC stats are looking better as of this date (23 Jan 2024). You could continue to iterate on the above ideas on a case-by-case basis, to see if the other dangling edges can be accounted for.
We're at about 80% fix. ~200K we are not getting. Is there something we need in that set? ... E.g., Bgee was using older ENSEMBL IDs instead of current ones, and this was causing trouble. We need an investigator to take this on. BUT it is not urgent and should not consume anyone's entire time.
Dear @madanucd -- do you happen to have any bandwidth?
Else, this goes on the icebox.
Related to #726
We're at about 80% fix. ~200K we are not getting. Is there something we need in that set? ... E.g., Bgee was using older ENSEMBL IDs instead of current ones, and this was causing trouble. We need an investigator to take this on. BUT it is not urgent and should not consume anyone's entire time.
Dear @madanucd -- do you happen to have any bandwidth?
Else, this goes on the icebox.
Sure, I can look into it.
Dangling edges found for STRING. See https://monarch-initiative.github.io/monarch-qc/ for the report; https://data.monarchinitiative.org/monarch-kg-dev/ for the data.
Find out why and suggest a repair.