The genetics ETL aggregates l2g prediction and study data to build disease/target evidence. This dataset is picked up by the platform ETL to integrate with other evidence sources to build disease/target association. To evaluate the performance of l2g prediction it is desirable to work with the association data, however that requires a full ETL run, which makes l2g iteration very slow.
The aim of this issue is to add a step to gentropy, and submsequently add one more task to the ETL orchestration to build direct and indirect evidence dataset.
Direct associations: take evidence, groupby target/disease and apply a harmonic sum on the l2g scores.
Indirect associations: take evidence join with diease index, explode parent terms, group by target/parent disease and apply a harmonic sum on the l2g scores.
These two datasets needs to be saved as parquet files together with other ETL output. Important: this dataset is not ingested by the platform ETL.
The genetics ETL aggregates l2g prediction and study data to build disease/target evidence. This dataset is picked up by the platform ETL to integrate with other evidence sources to build disease/target association. To evaluate the performance of l2g prediction it is desirable to work with the association data, however that requires a full ETL run, which makes l2g iteration very slow.
The aim of this issue is to add a step to gentropy, and submsequently add one more task to the ETL orchestration to build direct and indirect evidence dataset.
These two datasets needs to be saved as parquet files together with other ETL output. Important: this dataset is not ingested by the platform ETL.