opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Improve the integration of the Metrics pipeline in the release #2612

Open ireneisdoomed opened 2 years ago

ireneisdoomed commented 2 years ago

Context

The OT Metrics are computed on a VM. All the necessary files are copied to the VM and then the relevant metrics are generated.

We want to migrate the logic to accept gs:// locations for the input data, so that no local files are used and therefore the pipeline can be submitted to a Dataproc cluster.

The current workflow is explained in detail in the Metrics repo: https://github.com/opentargets/ot-release-metrics#run It essentially entails:

  1. Setting up the environment in a VM
  2. Copying the input data: we use PIS for the pre pipeline metrics and the gs://open-targets-data-releases buckets for post pipeline.
  3. Running metrics.py. This will generate a CSV file with all the metrics.
  4. Uploading the CSV to the data folder in Github.
  5. The app will be automatically redeployed to make the latest data available.

Actions

We want to streamline the above workflow. For that, we've identified 3 actions:

  1. Adapt the metrics script to work with remote data (#2613)
  2. Parametrise the metrics script with a configuration file (#2614)
  3. Make the metrics app fetch data from a google bucket (#2615)
ireneisdoomed commented 1 year ago

The code has been adapted to follow the mentioned requirements (PR#19).

Instructions to run the code are available in the docs: https://github.com/opentargets/ot-release-metrics/blob/il-2614/docs/metric-calculation.md

We probably want to upscale the machine since it currently takes a significant amount of time (~45min). I've been able to generate the metrics for the latest ETL outputs in the dev bucket: gs://open-targets-pre-data-releases/development/output/etl/parquet

The outputs are uploaded to 2 locations:

Note: we've temporarily dropped the functionality of running the pre ETL metrics as PIS is a dependency at the moment and I haven't figured that out yet.

Now I guess it's a matter of plugging it to the ETL.

buniello commented 1 year ago

please also see link to a relevant closed ticket

mbdebian commented 1 year ago

@ireneisdoomed and @prashantuniyal02 , I think we should close this issue, as the metrics things is taking a completely different shape

mbdebian commented 5 months ago

@ireneisdoomed Let's revisit this issue, as it's my understanding that some containerization work around the metrics pipeline is being completed by @tskir .