glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Lost citations #1141

Open kmartinez834 opened 4 months ago

kmartinez834 commented 4 months ago

A significant number of citations were lost from UniProt and RefSeq protein files. Do Medline files need to be downloaded, or is this a processing error?

"id_count_diff","id_count_new","id_count_old","row_count_diff","row_count_new","row_count_old","field_count_diff","field_count_new","field_count_old","dataset_file_name","status_flags"
"-203","11936","12139","-6225","33820","40045","0","9","9","mouse_protein_citations_refseq.csv","old_dataset; rowcount_change; idcount_change"
"-424","4980","5404","-1592","13679","15271","0","9","9","rat_protein_citations_refseq.csv","old_dataset; rowcount_change; idcount_change"
"-1997","4063","6060","-40662","12923","53585","0","9","9","yeast_protein_citations_uniprotkb.csv","old_dataset; rowcount_change; idcount_change"
"-2889","11444","14333","-26614","35156","61770","0","9","9","mouse_protein_citations_uniprotkb.csv","old_dataset; rowcount_change; idcount_change"
"-3016","14602","17618","-66703","85223","151926","0","9","9","human_protein_citations_uniprotkb.csv","old_dataset; rowcount_change; idcount_change"
"-3647","5215","8862","-11242","13422","24664","0","9","9","rat_protein_citations_uniprotkb.csv","old_dataset; rowcount_change; idcount_change"
"-10989","2835","13824","-189059","6501","195560","0","9","9","fruitfly_protein_citations_uniprotkb.csv","old_dataset; rowcount_change; idcount_change"
"-12079","647","12726","-17141","1119","18260","0","9","9","dicty_protein_citations_uniprotkb.csv","old_dataset; rowcount_change; idcount_change"

Ex. All of the RefSeq pmids are missing from https://tst.api.glygen.org/protein/detail/P55067 (were present in last release)

{
            "title": "Upregulation of CSPG3 accompanies neuronal progenitor proliferation and migration in EAE.",
            "journal": "Journal of molecular neuroscience : MN",
            "date": "2011",
            "authors": "Sajad M, Zargan J, Chawla R, Umar S, Khan HA",
            "evidence": [
                {
                    "database": "RefSeq",
                    "id": "NP_113841",
                    "url": "https://www.ncbi.nlm.nih.gov/protein/NP_113841"
                }
            ],
            "reference": [
                {
                    "type": "PubMed",
                    "id": "21107918",
                    "url": "https://glygen.org/publication/PubMed/21107918"
                }
            ]
        },
rykahsay commented 4 months ago

Luke had reported this issue and I fixed it already. Looks like you check was done on Feb 26, and I have done the same check after I fixed it on Mar 6. Just to make sure, please run the script again

image
kmartinez834 commented 4 months ago

The following issues remain after rerunning the script:

mouse_protein_function_refseq.csv and rat_protein_function_refseq.csv are missing accessions, resulting in missing citations.

See A0A096MJ01-1, P55067, A0A087WRT4-1, A0A0A6YYP6-1

"-244","13147","13391","-12801","69978","82779","0","5","5","mouse_protein_function_refseq.csv","old_dataset; rowcount_change; idcount_change"
"-202","11937","12139","-6224","33821","40045","0","9","9","mouse_protein_citations_refseq.csv","old_dataset; rowcount_change; idcount_change"
"-467","5843","6310","-3297","28339","31636","0","5","5","rat_protein_function_refseq.csv","old_dataset; rowcount_change; idcount_change"
"-423","4981","5404","-1591","13680","15271","0","9","9","rat_protein_citations_refseq.csv","old_dataset; rowcount_change; idcount_change"
rykahsay commented 3 months ago

In downloads/ebi/current/uniprot-proteome-rattus-norvegicus.nt

<http://purl.uniprot.org/uniprot/G3V8R2> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://purl.uniprot.org/refseq/NP_113841.2> .

In rat_protein_masterlist.csv | grep G3V8R2

"uniprotkb_canonical_ac","status","gene_name","reviewed_isoforms","unreviewed_isoforms"
"P55067-1","reviewed","Ncan","P55067-1","G3V8R2-1"

Since NP_113841.2 is not mapping to the canonical isoform, it is not being included in rat_protein_xref.csv anymore, which means no entry by NP_113841 should exist in rat_protein_function_refseq --> no entry by NP_113841 in rat_protein_citations_refseq.

I want @jeet to look deeper into this and passing this issue to 2.5