opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Merge `epmc` into the `literature` step #3099

Closed ehsanb29 closed 1 month ago

ehsanb29 commented 1 year ago

As a developer I want to include the epmc into the literature step based on @d0choa's comment here: "They are conceptually the same step"

Background

epmc step corrently runs after literature step. evidence step runs after epmc although it wasn't mentioned here

Tasks

Acceptance tests

How do we know the task is complete?

  1. When the tests was successful.
  2. When Data team approved the outputs.
prashantuniyal02 commented 11 months ago

depends on: #3111

mbdebian commented 6 months ago

@remo87 , would you mind checking with @d0choa on whether this is doable for the next release? Thanks!

d0choa commented 6 months ago

This is a refactor. These are 2 steps in the ETL that are virtually the same thing. They process the literature inputs and produce different outputs. In the past, one of them was a different repo, that was eventually migrated to the same repo but never harmonised with the existing codebase. At the moment, they have 2 configuration objects, 2 steps, etc. but as said above they are conceptually the same thing.

@remo87 if you can spend a little bit of time evaluating the complexity of the task it could help. Nothing is broken but we could clean up quite a bit of code and make things more clear.

remo87 commented 6 months ago

@d0choa @mbdebian I just checked the code and there shouldn't be any issues with merging these two steps. I think I can start working on this item.

remo87 commented 5 months ago

The merging is done and the code is located in the branch 3099-merge-epmc I ran a test of the ETL and the output that was generated is located in gs://open-targets-pre-data-releases/ricardo_24.04-2/output @ireneisdoomed could you help me checking that the output is correct and hasn't been affected by the change?

ireneisdoomed commented 5 months ago

The changes in 3099-merge-epmc consist of moving the configuration for the epmc step into the literature one.

This affects 2 outputs:

I've compared Ricardo's outputs (generated on 19/04) against the latest ETL run (generated on 23/04).

Technically my understanding is that we shouldn't see any difference in these outputs. However, this is not straightforward to prove because the inputs to the literature step change between runs and are therefore not versioned. I think numbers indicate something is off here, I'd suggest reviewing @remo87. Ideally with test input data so you can reproduce.

remo87 commented 5 months ago

Hi @ireneisdoomed yesterday I ran the literature step with and without the epmc integrated. I ran the step with a subset of the literature inputs ignoring the inputs from 2022. These inputs are located in gs://open-targets-pre-data-releases/ricardo-24.05/ml02/. The outputs for the literature step without epmc are located in gs://open-targets-pre-data-releases/ricardo-24.05/tmp/ and the outputs for the step with with epmc integrated are located in gs://open-targets-pre-data-releases/ricardo-24.05/output/etl/parquet/literature

ireneisdoomed commented 5 months ago

@remo87 can you do the same checks I did on both runs to see your changes don't have an impact? Thanks

remo87 commented 4 months ago

Hi @ireneisdoomed, I updated the outputs running the steps again using the same subset of inputs and copied the outputs to gs://open-targets-pre-data-releases/ricardo-24.05/tmp the outputs with the epmc included are in the epmc folder and the outputs for the etl that doesn't have the epmc included are in the noepmc folder. I compared both outputs and I got the same count in both the count for the epmcCoocurrences was 17270 in both cases and the evidences were 30275.

ireneisdoomed commented 4 months ago

Cool! Green flag from me then