glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

PMID duplicate mapping #133

Open kmartinez834 opened 1 year ago

kmartinez834 commented 1 year ago

Start using generated/misc/dup_pmid_mapping.csv when creating datasets.

Some of the Medline files indicate that the PMID is a duplicate:

$ cat /data/projects/glygen/downloads/ncbi/medline/pmid.31088862.txt
id: 31088862 Error occurred: PMID 31088862 is a duplicate of PMID 31267715

Issue continued from https://github.com/glygener/glygen-issues/issues/105

kmartinez834 commented 3 months ago

FYI @ubhuiyan and @katewarner ...

Not planning to address this for now, but be aware that some pmid txt files have errors.

kmartinez834 commented 3 months ago

I used the file /software/glygen/medline_dup_parser.py to print a list of files that have errors. There are currently only 20 out of 294034. I opened each file and added those with duplicates to generated/misc/dup_pmid_mapping.csv

Not the most sophisticated solution, so if you decide to start mapping pmid's, you (or Robel) can could write a script to generate the dup_pmid_mapping file