Missing Uniprotkb_cannonical_ac from protein_homolog_clusters.csv

Luke-Johnson-5 commented 6 months ago

In /data/projects/glygen/generated/datasets/unreviewed/protein_homolog_clusters.csv there appears to be issues with adding these canonical accessions. One such example of this is with Q9VEJ1-1;

In this unreviewed file we see this line:

"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id" - (header) "961554","","7227","protein_xref_oma","961554"

while in the version from 2.4.1 the accession is present:

"961554","Q9VEJ1-1","7227","protein_xref_oma","961554"

In the download json this accession is still present:

downloads/oma/current/fruitfly_mouse_oma_orthologs.json

{
    "entry_1": {
        "entry_nr": 16449465,
        "entry_url": "https://omabrowser.org/api/protein/16449465/",
        "omaid": "DROME19606",
        "canonicalid": "Q9VEJ1",
        "sequence_md5": "bb93816e4791608ba555c2492b744728",
        "sequence_length": 787,
        "species": {
            "code": "DROME",
            "taxon_id": 7227,
            "species": "Drosophila melanogaster",
            "genome_url": "https://omabrowser.org/api/genome/DROME/"
        },
        "oma_group": 961554,
        "oma_hog_id": "HOG:D0687673.3c.5d.4b.1a.1a",
        "chromosome": "3R",
        "locus": {
            "start": 17540625,
            "end": 17544932,
            "strand": 1
        },
        "is_main_isoform": true
    },
    "entry_2": {
        "entry_nr": 12346396,
        "entry_url": "https://omabrowser.org/api/protein/12346396/",
        "omaid": "MOUSE25511",
        "canonicalid": "SF01_MOUSE",
        "sequence_md5": "9531e3f815852bfb6903436d8d437f6d",
        "sequence_length": 639,
        "species": {
            "code": "MOUSE",
            "taxon_id": 10090,
            "species": "Mus musculus",
            "genome_url": "https://omabrowser.org/api/genome/MOUSE/"
        },
        "oma_group": 961554,
        "oma_hog_id": "HOG:D0687673.3c.5d.4b.2b.7c",
        "chromosome": "19",
        "locus": {
            "start": 6414118,
            "end": 6426025,
            "strand": 1
        },
        "is_main_isoform": true
    },
    "rel_type": "1:1",
    "distance": 55,
    "score": 1992.5400390625
}

This appears to be an issue the way the mapping file /data/projects/glygen/downloads/oma/current/oma-uniprot.txt is being used to map the omaid with the canonical accession. This is just one example and many uniprotkb canonical accessions are missing. Karina will provide more information on what we believe is occurring:

kmartinez834 commented 6 months ago

There are 54264 rows in protein_homolog_clusters.csv with a missing uniprotkb_canonical_ac. We suspect this is due to a mapping issue.

In the json source files, each organism entry contains an omaidand and a canonicalid:

"omaid": "DROME17888",
"canonicalid": "HSP70_DROME",

The canonicalid can contain any of the following: UniProt canonical ID (Q9VEJ1), UniProt non-canonical ID (A0A8V0ZP44), UniProt ID (SF01_MOUSE), RefSeq gene id (XP_040526881), Ensembl ID (ENSSSCG00000023819.2) - it can also contain UniProt IDs that are not present in the *_protein_info_uniprot.csv files (LAP2B_MOUSE)

👉 Use the file downloads/oma/current/oma-uniprot.txt to map the omaid to the various UniProt ids (the mapping is to both TrEML and Swiss-Prot and also to any of the protein names and primary and secondary ACs). Then map from the UniProt ids to our canonical UniProt accession using the *_protein_masterlist.csv* and _protein_info_uniprot.csv** files.

It's possible that an omaid can map to multiple canonical UniProt ids:

$ grep DROME17888 /data/projects/glygen/downloads/oma/current/oma-uniprot.txt
DROME17888      HSP70_DROME
DROME17888      P02825
DROME17888      P82910

Please create multiple rows for such cases in the protein_homolog_clusters.csv dataset:

"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id"
"964271","P02825-1","7227","protein_xref_oma","964271"
"964271","P82910-1","7227","protein_xref_oma","964271"

rykahsay commented 6 months ago

check now

Luke-Johnson-5 commented 6 months ago

I believe this is fixed now @kmartinez834

kmartinez834 commented 6 months ago

The example rows above are still missing from the output file:

"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id"
"964271","P02825-1","7227","protein_xref_oma","964271"
"964271","P82910-1","7227","protein_xref_oma","964271"

rykahsay commented 6 months ago

@kmartinez834 -- explain why "P02825-1" and "P82910-1" should be in homolog_cluster_id= 964271? According to my processing, they should be in homolog_cluster_id= 966188?

"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id"
"966188","Q54BE0-1","44689","protein_xref_oma","966188"
"966188","P22202-1","559292","protein_xref_oma","966188"
"966188","P82910-1","7227","protein_xref_oma","966188"
"966188","P02825-1","7227","protein_xref_oma","966188"

"homolog_cluster_id","uniprotkb_canonical_ac","tax_id","xref_key","xref_id"
"964271","Q61696-1","10090","protein_xref_oma","964271"
"964271","P0DMV8-1","9606","protein_xref_oma","964271"
"964271","P10592-1","559292","protein_xref_oma","964271"
"964271","P29843-1","7227","protein_xref_oma","964271"
"964271","P0DMW0-1","10116","protein_xref_oma","964271"

kmartinez834 commented 6 months ago

@rykahsay homolog_cluster_id=964271 was incorrect, however "P02825-1" and "P82910-1" should both be in homolog_cluster_id=966188 and homolog_cluster_id=979559:

$ grep P82910 /data/projects/glygen/downloads/oma/current/oma-uniprot.txt
DROME17888      P82910
DROME23613      P82910
$ grep P02825 /data/projects/glygen/downloads/oma/current/oma-uniprot.txt | grep -v [A-Z]P02825
DROME17888      P02825
DROME23613      P02825

Searching oma/current/fruitfly_yeast_oma_orthologs.json for DROME17888 and DROME23613:

{
    "entry_1": {
        "entry_nr": 16447747,
        "entry_url": "https://omabrowser.org/api/protein/16447747/",
        "omaid": "DROME17888",
        "canonicalid": "HSP70_DROME",
        "sequence_md5": "954ea190c6a47965bc01c7003664a099",
        "sequence_length": 642,
        "species": {
            "code": "DROME",
            "taxon_id": 7227,
            "species": "Drosophila melanogaster",
            "genome_url": "https://omabrowser.org/api/genome/DROME/"
        },
        "oma_group": 979559,
        "oma_hog_id": "HOG:D0679423.10bm.107b.82a.95c.69b.51b.30a.15d",
        "chromosome": "3R",
        "locus": {
            "start": 11958789,
            "end": 11960717,
            "strand": 1
        },
        "is_main_isoform": true
    }
}

{
    "entry_1": {
        "entry_nr": 16453472,
        "entry_url": "https://omabrowser.org/api/protein/16453472/",
        "omaid": "DROME23613",
        "canonicalid": "HSP70_DROME",
        "sequence_md5": "954ea190c6a47965bc01c7003664a099",
        "sequence_length": 642,
        "species": {
            "code": "DROME",
            "taxon_id": 7227,
            "species": "Drosophila melanogaster",
            "genome_url": "https://omabrowser.org/api/genome/DROME/"
        },
        "oma_group": 966188,
        "oma_hog_id": "HOG:D0679423.10bm.107b.82a.95c.69b.51b.30a.15c",
        "chromosome": "3R",
        "locus": {
            "start": 11954361,
            "end": 11956289,
            "strand": -1
        },
        "is_main_isoform": true
    }
}

rykahsay commented 6 months ago

check now

kmartinez834 commented 6 months ago

I see 9483 pairs of uniprot_ac & homolog cluster still missing. Here are some examples:

uniprotkb_canonical_ac,homolog_cluster_id
Q9VZW0-1,844259
Q9VZY9-1,1245919
Q9VZZ4-1,988692
Q9VZZ5-1,869300
Q9VZZ8-1,1220253
Q9W011-1,930694
Q9W024-1,1164480

Here's where the first example above comes from:

$ grep Q9VZW0 /data/projects/glygen/downloads/oma/current/oma-uniprot.txt 
DROME15838      Q9VZW0

$ grep DROME15838 /data/projects/glygen/downloads/oma/current/*.json | awk -F: '{print $1}'
/data/projects/glygen/downloads/oma/current/chicken_fruitfly_oma_orthologs.json
/data/projects/glygen/downloads/oma/current/dicty_fruitfly_oma_orthologs.json
/data/projects/glygen/downloads/oma/current/fruitfly_chicken_oma_orthologs.json
/data/projects/glygen/downloads/oma/current/fruitfly_dicty_oma_orthologs.json

fruitfly_dicty_oma_orthologs.json

{
    "entry_1": {
        "entry_nr": 16445697,
        "entry_url": "https://omabrowser.org/api/protein/16445697/",
        "omaid": "DROME15838",
        "canonicalid": "Q9VZW0",
        "sequence_md5": "28e38284382e0928a36253fa884181dc",
        "sequence_length": 661,
        "species": {
            "code": "DROME",
            "taxon_id": 7227,
            "species": "Drosophila melanogaster",
            "genome_url": "https://omabrowser.org/api/genome/DROME/"
        },
        "oma_group": 844259,
        "oma_hog_id": "HOG:D0682320.2a.1a",
        "chromosome": "3L",
        "locus": {
            "start": 3047190,
            "end": 3049654,
            "strand": 1
        },
        "is_main_isoform": true
    }
}

rykahsay commented 6 months ago

Looks like these are singlton clusters (clusters containing one member at the uniprotkb_canonical_ac) and I exclude clusters with single members (since these will not result in alignments).

I have done cluster size analysis and the file /tmp/oma_singletons.txt contains 9474 oma_groups containing single members.

$ wc /tmp/oma_singletons.txt 
 9474  9474 69816 /tmp/oma_singletons.txt

Given below is the first three oma_group IDs you have listed and I am showing all their members at oma_id, uniprotkb_ac and uniprotkb_canonical_ac levels.

oma_group,oma_id,uniprotkb_ac,uniprotkb_canonical_ac
1245919,DROME14594,Q9VZY9,Q9VZY9-1
844259,DROME15838,Q9VZW0,Q9VZW0-1
988692,DROME13937,Q9VZZ4,Q9VZZ4-1

kmartinez834 commented 6 months ago

Got it, confirmed that all "missing" cluster id's are singletons

glygener / glygen-issues

Missing Uniprotkb_cannonical_ac from protein_homolog_clusters.csv #1271