Open ireneisdoomed opened 2 years ago
The code has been adapted to follow the mentioned requirements (PR#19).
Instructions to run the code are available in the docs: https://github.com/opentargets/ot-release-metrics/blob/il-2614/docs/metric-calculation.md
We probably want to upscale the machine since it currently takes a significant amount of time (~45min).
I've been able to generate the metrics for the latest ETL outputs in the dev bucket: gs://open-targets-pre-data-releases/development/output/etl/parquet
The outputs are uploaded to 2 locations:
metadata/metrics/${run_id}.csv
)gs://otar000-evidence_input/release-metrics
)Note: we've temporarily dropped the functionality of running the pre ETL metrics as PIS is a dependency at the moment and I haven't figured that out yet.
Now I guess it's a matter of plugging it to the ETL.
please also see link to a relevant closed ticket
@ireneisdoomed and @prashantuniyal02 , I think we should close this issue, as the metrics things is taking a completely different shape
@ireneisdoomed Let's revisit this issue, as it's my understanding that some containerization work around the metrics pipeline is being completed by @tskir .
Context
The OT Metrics are computed on a VM. All the necessary files are copied to the VM and then the relevant metrics are generated.
We want to migrate the logic to accept
gs://
locations for the input data, so that no local files are used and therefore the pipeline can be submitted to a Dataproc cluster.The current workflow is explained in detail in the Metrics repo: https://github.com/opentargets/ot-release-metrics#run It essentially entails:
gs://open-targets-data-releases
buckets for post pipeline.metrics.py
. This will generate a CSV file with all the metrics.data
folder in Github.Actions
We want to streamline the above workflow. For that, we've identified 3 actions: