waldronlab / bugsigdbr

R-side access to published microbial signatures from BugSigDB
https://bioconductor.org/packages/bugsigdbr
GNU General Public License v3.0
4 stars 2 forks source link

import PMID column as numeric #38

Closed lwaldron closed 1 year ago

lwaldron commented 1 year ago

We have one corner case (https://bugsigdb.org/Study_731) where a curator entered a leading zero for the PMID. The PubMed website ignores leading zeros (for example, try https://pubmed.ncbi.nlm.nih.gov/00000031682463/), so it works normally on bugsigdb.org. We should ignore it too by importing the PMID column as numeric. I noticed this because the PMID column was numeric before the exports broke 3 weeks ago, and now it is character.

lgeistlinger commented 1 year ago

Sorry, but it doesn't seem like a good idea to support that on bugsigdb.org itself as it leads to study duplication:

https://bugsigdb.org/Study_580 (PMID entered without leading 0) https://bugsigdb.org/Study_731 (PMID entered with leading 0)

I would actually say this should be prevented directly on bugsigdb.org. I'll open an issue if you agree.

lgeistlinger commented 1 year ago

As for representing PMIDs as numeric or character. For type safety, I would actually rather make the case that those should be characters, very much the same argument why we represent Entrez Gene IDs or NCBI Taxon IDs as characters. Those are identifiers and not numbers.

lwaldron commented 1 year ago

Now that you've pointed out the duplication problem on the wiki I agree, about type character and preventing leading zeroes in PMID on the wiki. Note that PMID type had changed from numeric to character since the last working GHA.