opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Missing patent data in production #3571

Open ireneisdoomed opened 1 month ago

ireneisdoomed commented 1 month ago

Describe the bug There are no patent references in the EPMC data in production (both in evidence and in matches). This causes at least a 91,238 drop in the number of EPMC evidence.

Observed behaviour

  1. In the 23.02 release notes we have a screenshot of evidence from patents (left) and on the right, EPMC evidence for the same association in production without the patent data. Image

  2. Looking more systematically, I couldn't find any patent references in the evidence or matches datasets.

    
    # Evidence
    epmc_refs = spark.read.parquet("gs://open-targets-pre-data-releases/24.09/output/etl/parquet/evidence/sourceId=europepmc").select(f.explode("literature").alias("ref"))
    epmc_refs_prefix = epmc_refs.withColumn('prefix', f.regexp_replace('ref', r"\d+\b", '')).withColumn("prefix", f.when(f.length("prefix") == 0, f.lit("PMID")).otherwise(f.col("prefix")))
    epmc_refs_prefix.groupBy(f.col("prefix")).count().orderBy(f.col("count").desc()).show()
    +------+--------+                                                               
    |prefix|   count|
    +------+--------+
    |  PMID|11516373|
    |   PPR|  117879|
    |   PMC|   25371|
    |   IND|    3828|
    +------+--------+

Matches

matches = spark.read.parquet("gs://open-targets-pre-data-releases/24.09/output/etl/parquet/literature/matches") matches_prefix = matches.select("pmid").withColumn('prefix', f.regexp_replace('pmid', r"\d+\b", '')).withColumn("prefix", f.when(f.length("prefix") == 0, f.lit("PMID")).otherwise(f.col("prefix"))) matches_prefix.groupBy(f.col("prefix")).count().orderBy(f.col("count").desc()).show() +------+---------+
|prefix| count| +------+---------+ | PMID|526707236| | PPR| 2639130| | PMC| 664915| | IND| 646238| | c| 9| +------+---------+


**Expected behaviour**
This is the evidence for the association mentioned above in 23.02:
```python
old_epmc_refs = spark.read.parquet("gs://open-targets-data-releases/23.02/output/etl/parquet/evidence/sourceId=europepmc").select(f.explode("literature").alias("ref"))
old_epmc_refs.filter(f.col("targetId") == "ENSG00000092969").filter(f.col("diseaseId") == "EFO_1001034").show()
+-------------+                                                                 
|          ref|
+-------------+
|    EP1994936|
|KR20080034212|
| US2005271641|
|    EP1758604|
|     15905428|
| US2010129336|
|    CA2560472|
| WO2005117532|
|KR20090038493|
+-------------+
ireneisdoomed commented 1 month ago

The problem is there since 23.06, meaning we only had patent data in 23.02.

spark.read.parquet("gs://open-targets-data-releases/23.06/output/etl/parquet/evidence/sourceId=europepmc").select(f.explode("literature").alias("ref")).filter(f.col("targetId") == "ENSG00000092969").filter(f.col("diseaseId") == "EFO_1001034").show()
+--------+                                                                      
|     ref|
+--------+
|35079053|
|15905428|
+--------+