Discrepancys in human_protein_citations_uniprotkb.csv

Luke-Johnson-5 commented 1 month ago

The file human_protein_citations_uniprotkb.csv in the directory /data/shared/glygen/releases/data/v-2.4.1/reviewed created on Mar 6 is 204025 lines long while this file in the directory /data/projects/glygen/generated/datasets/reviewed which was created one day later on Mar 7 is only 142448 lines long. Do you know why this discrepancy exists in the data? This seems to be causing issues while checking the datasets for 2.5. I talked to Karina and she suggested perhaps a timeout on the Sparql endpoint.

Interestingly, the human_protein_citations_uniprotkb.stat.csv file in both directories is the same.

Screenshots showing this discrepancy:

rykahsay commented 1 month ago

The PMIDs come from four sources and as shown below, the 142448 size seems to be the right one

unreviewed/human_protein_function_uniprotkb.csv
unreviewed/human_protein_ptm_annotation_uniprotkb.csv
unreviewed/human_protein_site_annotation_uniprotkb.csv
downloads/ebi/current/uniprot-proteome-homo-sapiens.nt

$ cat unreviewed/human_protein_site_annotation_uniprotkb.csv | awk -F"," '{print $10,$11}'  | grep protein_xref_pubmed |sort -u |wc
  30423   60846  997755

$ cat unreviewed/human_protein_ptm_annotation_uniprotkb.csv | awk -F"," '{print $2,$3}'  | grep protein_xref_pubmed |sort -u  |wc
   7544   15088  248046

$ cat unreviewed/human_protein_function_uniprotkb.csv | awk -F"," '{print $2,$3}'  | grep protein_xref_pubmed |sort -u  |wc
  29467   58934  969231

$ cat downloads/ebi/current/uniprot-proteome-homo-sapiens.nt  | grep "<http://purl.uniprot.org/core/citation>" | awk '{print $3}' |sort -u |wc
99136   99136 4433957

Luke-Johnson-5 commented 1 month ago

I see this is all good then

glygener / glygen-issues

Discrepancys in human_protein_citations_uniprotkb.csv #1247