Definition of disease/phenotype linkage

From the dataset described here, I have defined whether a mentioned entity is a disease or a phenotype by looking at the ancestors for each ID.

If "phenotype" or "measurement" was an ancestor of the mentioned ID, the term is tagged as phenotype otherwise it is considered a disease.

In the end, out of the 136_814_572 strings, 11_416_088 (~10%) have been labeled as phenotypes.

Code to reproduce:

diseases = (
    spark.read.parquet("gs://open-targets-pre-data-releases/23.02/output/etl/parquet/diseases")
    .withColumn("ancestors", f.array_union(f.array(f.col("id")), f.col("ancestors")))
    .withColumn("isPhenotype", f.when(
        (f.array_contains(f.col("ancestors"), phenotype_ids[0])) |
        (f.array_contains(f.col("ancestors"), phenotype_ids[1])), f.lit(True))
        .otherwise(f.lit(False))
    )
    .withColumn("isDisease", f.when(f.col("isPhenotype") == True, f.lit(False)).otherwise(f.lit(True))
)

matched_diseases = matched_diseases.join(diseases.withColumnRenamed("id", "efo_id"), on="efo_id", how="inner")
matched_diseases.write.option("compression", "gzip").mode("overwrite").parquet("gs://ot-team/irene/matches_diseases")
matched_diseases.printSchema()
root
 |-- efo_id: string (nullable = true)
 |-- pmid: string (nullable = true)
 |-- pmcid: string (nullable = true)
 |-- text: string (nullable = true)
 |-- label: string (nullable = true)
 |-- ancestors: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- isPhenotype: boolean (nullable = true)
 |-- isDisease: boolean (nullable = true)

Now we have a dataset to build our vector space.

ireneisdoomed / phenomena

Definition of disease/phenotype linkage #5