Open kmartinez834 opened 8 months ago
Luke had reported this issue and I fixed it already. Looks like you check was done on Feb 26, and I have done the same check after I fixed it on Mar 6. Just to make sure, please run the script again
The following issues remain after rerunning the script:
mouse_protein_function_refseq.csv
and rat_protein_function_refseq.csv
are missing accessions, resulting in missing citations.
See A0A096MJ01-1, P55067, A0A087WRT4-1, A0A0A6YYP6-1
"-244","13147","13391","-12801","69978","82779","0","5","5","mouse_protein_function_refseq.csv","old_dataset; rowcount_change; idcount_change"
"-202","11937","12139","-6224","33821","40045","0","9","9","mouse_protein_citations_refseq.csv","old_dataset; rowcount_change; idcount_change"
"-467","5843","6310","-3297","28339","31636","0","5","5","rat_protein_function_refseq.csv","old_dataset; rowcount_change; idcount_change"
"-423","4981","5404","-1591","13680","15271","0","9","9","rat_protein_citations_refseq.csv","old_dataset; rowcount_change; idcount_change"
In downloads/ebi/current/uniprot-proteome-rattus-norvegicus.nt
<http://purl.uniprot.org/uniprot/G3V8R2> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://purl.uniprot.org/refseq/NP_113841.2> .
In rat_protein_masterlist.csv | grep G3V8R2
"uniprotkb_canonical_ac","status","gene_name","reviewed_isoforms","unreviewed_isoforms"
"P55067-1","reviewed","Ncan","P55067-1","G3V8R2-1"
Since NP_113841.2 is not mapping to the canonical isoform, it is not being included in rat_protein_xref.csv anymore, which means no entry by NP_113841 should exist in rat_protein_function_refseq --> no entry by NP_113841 in rat_protein_citations_refseq.
I want @jeet to look deeper into this and passing this issue to 2.5
@jeet-vora I've assigned this to you in case you haven't seen it. Please let me know if you'd like me to do anything to help.
A significant number of citations were lost from UniProt and RefSeq protein files. Do Medline files need to be downloaded, or is this a processing error?
Ex. All of the RefSeq pmids are missing from https://tst.api.glygen.org/protein/detail/P55067 (were present in last release)