Open nicklamiller opened 2 weeks ago
I agree it's confusing. The background is that EuropePMC uses all these identifiers internally and they don't necessarily make a separation depending on the corpus they come from. All IDs are treated the same
Some options that come to mind:
pmid
field to something more generic like documentId
.id
and source
We need to think about this and also about the downstream consequences.
Rename the
pmid
field to something more generic likedocumentId
.
That sounds prudent since documents can be things other than PubMed records. If backwards compatability is needed, could always keep true pubmed IDs in pmid and use documentId
for the mixed source IDs.
Create a more complex object with
id
andsource
I wonder if its possible to keep as a single field but adopt CURIEs as per https://bioregistry.io/ or https://identifiers.org/ prefixes. @nicklamiller can you provide an example of each type of identifier?
Specifically are PPR and IND records singular databases? And if so which ones?
Or possible EuropePMC has a source type that can be combined with the identifier for less ambiguous identification. I can't quite find EuropePMC docs on the different identifier/document types. SRC:PMC
and SRC:PPR
both filter search results.
@d0choa thank you for the prompt reply and @dhimmel thanks for the helpful links. A couple example ID's:
keep as a single field but adopt CURIEs
That plus SRC:PPR
formatting (if possible) sound like a great approach.
Describe the bug There are rows that contain non-PMID IDs in the
PMID
field of theliterature/matches
data.Observed behaviour Specifically, there are IDs with the following prefixes (and what I suspect they mean):
In a single parquet file (used in the reprex below) from the
literature/matches
data ~1.5% of distinct PMIDs have a non-PMID ID.Expected behaviour I believe the
PMID
field should only contain integer values corresponding to true PMIDs.To Reproduce
Additional context I've used a single file in the above reprex; however, this issue persists across all data corresponding to
literature/matches
with ~2.6% of PMIDs containing a non PMID.