opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Non-PMID IDs in `PMID` field of `literature/matches` data #3341

Open nicklamiller opened 2 weeks ago

nicklamiller commented 2 weeks ago

Describe the bug There are rows that contain non-PMID IDs in the PMID field of the literature/matches data.

Observed behaviour Specifically, there are IDs with the following prefixes (and what I suspect they mean):

In a single parquet file (used in the reprex below) from the literature/matches data ~1.5% of distinct PMIDs have a non-PMID ID.

Expected behaviour I believe the PMID field should only contain integer values corresponding to true PMIDs.

To Reproduce

from pyspark.sql import SparkSession, functions as F

!gsutil -u <project-name> cp \
    gs://open-targets-data-releases/24.03/output/etl/parquet/literature/matches/part-00003-ce7bd6be-7ab2-44aa-ae3f-53e0640fd1b9-c000.snappy.parquet \
    /tmp/

spark = SparkSession.builder.appName("test").getOrCreate()
sdf = spark.read.parquet("/tmp/part-00003-ce7bd6be-7ab2-44aa-ae3f-53e0640fd1b9-c000.snappy.parquet")
(
    sdf.select("PMID").distinct().withColumn(
        "has_non_pmid_id",
        F.when(F.col("PMID").rlike("[A-Za-z]"), 1).otherwise(0)
    )
    .select(
        100 * F.mean("has_non_pmid_id").alias("pct_non_pmid_ids"),
        F.count("has_non_pmid_id").alias("num_non_pmid_ids")
    ).show()
)
sdf.filter(F.col("PMID").rlike("[A-z]")).show()

Additional context I've used a single file in the above reprex; however, this issue persists across all data corresponding to literature/matches with ~2.6% of PMIDs containing a non PMID.

d0choa commented 2 weeks ago

I agree it's confusing. The background is that EuropePMC uses all these identifiers internally and they don't necessarily make a separation depending on the corpus they come from. All IDs are treated the same

Some options that come to mind:

We need to think about this and also about the downstream consequences.

dhimmel commented 2 weeks ago

Rename the pmid field to something more generic like documentId.

That sounds prudent since documents can be things other than PubMed records. If backwards compatability is needed, could always keep true pubmed IDs in pmid and use documentId for the mixed source IDs.

Create a more complex object with id and source

I wonder if its possible to keep as a single field but adopt CURIEs as per https://bioregistry.io/ or https://identifiers.org/ prefixes. @nicklamiller can you provide an example of each type of identifier?

Specifically are PPR and IND records singular databases? And if so which ones?

dhimmel commented 2 weeks ago

Or possible EuropePMC has a source type that can be combined with the identifier for less ambiguous identification. I can't quite find EuropePMC docs on the different identifier/document types. SRC:PMC and SRC:PPR both filter search results.

nicklamiller commented 2 weeks ago

@d0choa thank you for the prompt reply and @dhimmel thanks for the helpful links. A couple example ID's:

keep as a single field but adopt CURIEs

That plus SRC:PPR formatting (if possible) sound like a great approach.