Closed Luke-Johnson-5 closed 6 months ago
There are 54264 rows in protein_homolog_clusters.csv with a missing uniprotkb_canonical_ac
. We suspect this is due to a mapping issue.
In the json source files, each organism entry contains an omaidand
and a canonicalid
:
"omaid": "DROME17888",
"canonicalid": "HSP70_DROME",
The canonicalid
can contain any of the following: UniProt canonical ID (Q9VEJ1), UniProt non-canonical ID (A0A8V0ZP44), UniProt ID (SF01_MOUSE), RefSeq gene id (XP_040526881), Ensembl ID (ENSSSCG00000023819.2) - it can also contain UniProt IDs that are not present in the *_protein_info_uniprot.csv files (LAP2B_MOUSE)
👉 Use the file downloads/oma/current/oma-uniprot.txt to map the omaid
to the various UniProt ids (the mapping is to both TrEML and Swiss-Prot and also to any of the protein names and primary and secondary ACs). Then map from the UniProt ids to our canonical UniProt accession using the *_protein_masterlist.csv* and _protein_info_uniprot.csv** files.
It's possible that an omaid can map to multiple canonical UniProt ids:
$ grep DROME17888 /data/projects/glygen/downloads/oma/current/oma-uniprot.txt
DROME17888 HSP70_DROME
DROME17888 P02825
DROME17888 P82910
Please create multiple rows for such cases in the protein_homolog_clusters.csv dataset:
"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id"
"964271","P02825-1","7227","protein_xref_oma","964271"
"964271","P82910-1","7227","protein_xref_oma","964271"
check now
I believe this is fixed now @kmartinez834
The example rows above are still missing from the output file:
"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id"
"964271","P02825-1","7227","protein_xref_oma","964271"
"964271","P82910-1","7227","protein_xref_oma","964271"
@kmartinez834 -- explain why "P02825-1" and "P82910-1" should be in homolog_cluster_id= 964271? According to my processing, they should be in homolog_cluster_id= 966188?
"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id"
"966188","Q54BE0-1","44689","protein_xref_oma","966188"
"966188","P22202-1","559292","protein_xref_oma","966188"
"966188","P82910-1","7227","protein_xref_oma","966188"
"966188","P02825-1","7227","protein_xref_oma","966188"
"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id"
"964271","Q61696-1","10090","protein_xref_oma","964271"
"964271","P0DMV8-1","9606","protein_xref_oma","964271"
"964271","P10592-1","559292","protein_xref_oma","964271"
"964271","P29843-1","7227","protein_xref_oma","964271"
"964271","P0DMW0-1","10116","protein_xref_oma","964271"
@rykahsay homolog_cluster_id=964271 was incorrect, however "P02825-1" and "P82910-1" should both be in homolog_cluster_id=966188 and homolog_cluster_id=979559:
$ grep P82910 /data/projects/glygen/downloads/oma/current/oma-uniprot.txt
DROME17888 P82910
DROME23613 P82910
$ grep P02825 /data/projects/glygen/downloads/oma/current/oma-uniprot.txt | grep -v [A-Z]P02825
DROME17888 P02825
DROME23613 P02825
Searching oma/current/fruitfly_yeast_oma_orthologs.json for DROME17888 and DROME23613:
{
"entry_1": {
"entry_nr": 16447747,
"entry_url": "https://omabrowser.org/api/protein/16447747/",
"omaid": "DROME17888",
"canonicalid": "HSP70_DROME",
"sequence_md5": "954ea190c6a47965bc01c7003664a099",
"sequence_length": 642,
"species": {
"code": "DROME",
"taxon_id": 7227,
"species": "Drosophila melanogaster",
"genome_url": "https://omabrowser.org/api/genome/DROME/"
},
"oma_group": 979559,
"oma_hog_id": "HOG:D0679423.10bm.107b.82a.95c.69b.51b.30a.15d",
"chromosome": "3R",
"locus": {
"start": 11958789,
"end": 11960717,
"strand": 1
},
"is_main_isoform": true
}
}
{
"entry_1": {
"entry_nr": 16453472,
"entry_url": "https://omabrowser.org/api/protein/16453472/",
"omaid": "DROME23613",
"canonicalid": "HSP70_DROME",
"sequence_md5": "954ea190c6a47965bc01c7003664a099",
"sequence_length": 642,
"species": {
"code": "DROME",
"taxon_id": 7227,
"species": "Drosophila melanogaster",
"genome_url": "https://omabrowser.org/api/genome/DROME/"
},
"oma_group": 966188,
"oma_hog_id": "HOG:D0679423.10bm.107b.82a.95c.69b.51b.30a.15c",
"chromosome": "3R",
"locus": {
"start": 11954361,
"end": 11956289,
"strand": -1
},
"is_main_isoform": true
}
}
check now
I see 9483 pairs of uniprot_ac & homolog cluster still missing. Here are some examples:
uniprotkb_canonical_ac,homolog_cluster_id
Q9VZW0-1,844259
Q9VZY9-1,1245919
Q9VZZ4-1,988692
Q9VZZ5-1,869300
Q9VZZ8-1,1220253
Q9W011-1,930694
Q9W024-1,1164480
Here's where the first example above comes from:
$ grep Q9VZW0 /data/projects/glygen/downloads/oma/current/oma-uniprot.txt
DROME15838 Q9VZW0
$ grep DROME15838 /data/projects/glygen/downloads/oma/current/*.json | awk -F: '{print $1}'
/data/projects/glygen/downloads/oma/current/chicken_fruitfly_oma_orthologs.json
/data/projects/glygen/downloads/oma/current/dicty_fruitfly_oma_orthologs.json
/data/projects/glygen/downloads/oma/current/fruitfly_chicken_oma_orthologs.json
/data/projects/glygen/downloads/oma/current/fruitfly_dicty_oma_orthologs.json
fruitfly_dicty_oma_orthologs.json
{
"entry_1": {
"entry_nr": 16445697,
"entry_url": "https://omabrowser.org/api/protein/16445697/",
"omaid": "DROME15838",
"canonicalid": "Q9VZW0",
"sequence_md5": "28e38284382e0928a36253fa884181dc",
"sequence_length": 661,
"species": {
"code": "DROME",
"taxon_id": 7227,
"species": "Drosophila melanogaster",
"genome_url": "https://omabrowser.org/api/genome/DROME/"
},
"oma_group": 844259,
"oma_hog_id": "HOG:D0682320.2a.1a",
"chromosome": "3L",
"locus": {
"start": 3047190,
"end": 3049654,
"strand": 1
},
"is_main_isoform": true
}
}
Looks like these are singlton clusters (clusters containing one member at the uniprotkb_canonical_ac) and I exclude clusters with single members (since these will not result in alignments).
I have done cluster size analysis and the file /tmp/oma_singletons.txt contains 9474 oma_groups containing single members.
$ wc /tmp/oma_singletons.txt
9474 9474 69816 /tmp/oma_singletons.txt
Given below is the first three oma_group IDs you have listed and I am showing all their members at oma_id, uniprotkb_ac and uniprotkb_canonical_ac levels.
oma_group,oma_id,uniprotkb_ac,uniprotkb_canonical_ac
1245919,DROME14594,Q9VZY9,Q9VZY9-1
844259,DROME15838,Q9VZW0,Q9VZW0-1
988692,DROME13937,Q9VZZ4,Q9VZZ4-1
Got it, confirmed that all "missing" cluster id's are singletons
In /data/projects/glygen/generated/datasets/unreviewed/protein_homolog_clusters.csv there appears to be issues with adding these canonical accessions. One such example of this is with Q9VEJ1-1;
In this unreviewed file we see this line:
"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id" - (header) "961554","","7227","protein_xref_oma","961554"
while in the version from 2.4.1 the accession is present:
"961554","Q9VEJ1-1","7227","protein_xref_oma","961554"
In the download json this accession is still present:
downloads/oma/current/fruitfly_mouse_oma_orthologs.json
This appears to be an issue the way the mapping file /data/projects/glygen/downloads/oma/current/oma-uniprot.txt is being used to map the omaid with the canonical accession. This is just one example and many uniprotkb canonical accessions are missing. Karina will provide more information on what we believe is occurring: