Non-PMID IDs in `PMID` field of `literature/matches` data

nicklamiller commented 2 weeks ago

Describe the bug There are rows that contain non-PMID IDs in the PMID field of the literature/matches data.

Observed behaviour Specifically, there are IDs with the following prefixes (and what I suspect they mean):

PMC - PMC IDs
PPR - pre-prints
IND - investigational new drug

In a single parquet file (used in the reprex below) from the literature/matches data ~1.5% of distinct PMIDs have a non-PMID ID.

Expected behaviour I believe the PMID field should only contain integer values corresponding to true PMIDs.

To Reproduce

from pyspark.sql import SparkSession, functions as F

!gsutil -u <project-name> cp \
    gs://open-targets-data-releases/24.03/output/etl/parquet/literature/matches/part-00003-ce7bd6be-7ab2-44aa-ae3f-53e0640fd1b9-c000.snappy.parquet \
    /tmp/

spark = SparkSession.builder.appName("test").getOrCreate()
sdf = spark.read.parquet("/tmp/part-00003-ce7bd6be-7ab2-44aa-ae3f-53e0640fd1b9-c000.snappy.parquet")
(
    sdf.select("PMID").distinct().withColumn(
        "has_non_pmid_id",
        F.when(F.col("PMID").rlike("[A-Za-z]"), 1).otherwise(0)
    )
    .select(
        100 * F.mean("has_non_pmid_id").alias("pct_non_pmid_ids"),
        F.count("has_non_pmid_id").alias("num_non_pmid_ids")
    ).show()
)
sdf.filter(F.col("PMID").rlike("[A-z]")).show()

Additional context I've used a single file in the above reprex; however, this issue persists across all data corresponding to literature/matches with ~2.6% of PMIDs containing a non PMID.

d0choa commented 2 weeks ago

I agree it's confusing. The background is that EuropePMC uses all these identifiers internally and they don't necessarily make a separation depending on the corpus they come from. All IDs are treated the same

Some options that come to mind:

Rename the pmid field to something more generic like documentId.
Separate the ids into different fields
Create a more complex object with id and source

We need to think about this and also about the downstream consequences.

dhimmel commented 2 weeks ago

Rename the pmid field to something more generic like documentId.

That sounds prudent since documents can be things other than PubMed records. If backwards compatability is needed, could always keep true pubmed IDs in pmid and use documentId for the mixed source IDs.

Create a more complex object with id and source

I wonder if its possible to keep as a single field but adopt CURIEs as per https://bioregistry.io/ or https://identifiers.org/ prefixes. @nicklamiller can you provide an example of each type of identifier?

Specifically are PPR and IND records singular databases? And if so which ones?

dhimmel commented 2 weeks ago

Or possible EuropePMC has a source type that can be combined with the identifier for less ambiguous identification. I can't quite find EuropePMC docs on the different identifier/document types. SRC:PMC and SRC:PPR both filter search results.

nicklamiller commented 2 weeks ago

@d0choa thank you for the prompt reply and @dhimmel thanks for the helpful links. A couple example ID's:

PPR138073
- according to bioregistry, the PPR prefix can come from several preprint servers e.g. bioRxiv, ChemRxiv
IND607760035 (not actually finding ID here, just used title to find paper)
- I'm not finding a database/source that IND denotes on either https://bioregistry.io/ or https://identifiers.org/

keep as a single field but adopt CURIEs

That plus SRC:PPR formatting (if possible) sound like a great approach.

opentargets / issues

Non-PMID IDs in `PMID` field of `literature/matches` data #3341