Closed Luke-Johnson-5 closed 1 month ago
The PMIDs come from four sources and as shown below, the 142448 size seems to be the right one
$ cat unreviewed/human_protein_site_annotation_uniprotkb.csv | awk -F"," '{print $10,$11}' | grep protein_xref_pubmed |sort -u |wc
30423 60846 997755
$ cat unreviewed/human_protein_ptm_annotation_uniprotkb.csv | awk -F"," '{print $2,$3}' | grep protein_xref_pubmed |sort -u |wc
7544 15088 248046
$ cat unreviewed/human_protein_function_uniprotkb.csv | awk -F"," '{print $2,$3}' | grep protein_xref_pubmed |sort -u |wc
29467 58934 969231
$ cat downloads/ebi/current/uniprot-proteome-homo-sapiens.nt | grep "<http://purl.uniprot.org/core/citation>" | awk '{print $3}' |sort -u |wc
99136 99136 4433957
I see this is all good then
The file human_protein_citations_uniprotkb.csv in the directory /data/shared/glygen/releases/data/v-2.4.1/reviewed created on Mar 6 is 204025 lines long while this file in the directory /data/projects/glygen/generated/datasets/reviewed which was created one day later on Mar 7 is only 142448 lines long. Do you know why this discrepancy exists in the data? This seems to be causing issues while checking the datasets for 2.5. I talked to Karina and she suggested perhaps a timeout on the Sparql endpoint.
Interestingly, the human_protein_citations_uniprotkb.stat.csv file in both directories is the same.
Screenshots showing this discrepancy: