Open ireneisdoomed opened 1 month ago
The problem is there since 23.06, meaning we only had patent data in 23.02.
spark.read.parquet("gs://open-targets-data-releases/23.06/output/etl/parquet/evidence/sourceId=europepmc").select(f.explode("literature").alias("ref")).filter(f.col("targetId") == "ENSG00000092969").filter(f.col("diseaseId") == "EFO_1001034").show()
+--------+
| ref|
+--------+
|35079053|
|15905428|
+--------+
Describe the bug There are no patent references in the EPMC data in production (both in evidence and in matches). This causes at least a 91,238 drop in the number of EPMC evidence.
Observed behaviour
In the 23.02 release notes we have a screenshot of evidence from patents (left) and on the right, EPMC evidence for the same association in production without the patent data.
Looking more systematically, I couldn't find any patent references in the evidence or matches datasets.
Matches
matches = spark.read.parquet("gs://open-targets-pre-data-releases/24.09/output/etl/parquet/literature/matches") matches_prefix = matches.select("pmid").withColumn('prefix', f.regexp_replace('pmid', r"\d+\b", '')).withColumn("prefix", f.when(f.length("prefix") == 0, f.lit("PMID")).otherwise(f.col("prefix"))) matches_prefix.groupBy(f.col("prefix")).count().orderBy(f.col("count").desc()).show() +------+---------+
|prefix| count| +------+---------+ | PMID|526707236| | PPR| 2639130| | PMC| 664915| | IND| 646238| | c| 9| +------+---------+