Open ireneisdoomed opened 1 year ago
From the dataset described here, I have defined whether a mentioned entity is a disease or a phenotype by looking at the ancestors for each ID.
If "phenotype" or "measurement" was an ancestor of the mentioned ID, the term is tagged as phenotype otherwise it is considered a disease.
In the end, out of the 136_814_572 strings, 11_416_088 (~10%) have been labeled as phenotypes.
Code to reproduce:
diseases = (
spark.read.parquet("gs://open-targets-pre-data-releases/23.02/output/etl/parquet/diseases")
.withColumn("ancestors", f.array_union(f.array(f.col("id")), f.col("ancestors")))
.withColumn("isPhenotype", f.when(
(f.array_contains(f.col("ancestors"), phenotype_ids[0])) |
(f.array_contains(f.col("ancestors"), phenotype_ids[1])), f.lit(True))
.otherwise(f.lit(False))
)
.withColumn("isDisease", f.when(f.col("isPhenotype") == True, f.lit(False)).otherwise(f.lit(True))
)
matched_diseases = matched_diseases.join(diseases.withColumnRenamed("id", "efo_id"), on="efo_id", how="inner")
matched_diseases.write.option("compression", "gzip").mode("overwrite").parquet("gs://ot-team/irene/matches_diseases")
matched_diseases.printSchema()
root
|-- efo_id: string (nullable = true)
|-- pmid: string (nullable = true)
|-- pmcid: string (nullable = true)
|-- text: string (nullable = true)
|-- label: string (nullable = true)
|-- ancestors: array (nullable = true)
| |-- element: string (containsNull = true)
|-- isPhenotype: boolean (nullable = true)
|-- isDisease: boolean (nullable = true)
Now we have a dataset to build our vector space.