Closed ehsanb29 closed 2 months ago
depends on: #3111
@remo87 , would you mind checking with @d0choa on whether this is doable for the next release? Thanks!
This is a refactor. These are 2 steps in the ETL that are virtually the same thing. They process the literature inputs and produce different outputs. In the past, one of them was a different repo, that was eventually migrated to the same repo but never harmonised with the existing codebase. At the moment, they have 2 configuration objects, 2 steps, etc. but as said above they are conceptually the same thing.
@remo87 if you can spend a little bit of time evaluating the complexity of the task it could help. Nothing is broken but we could clean up quite a bit of code and make things more clear.
@d0choa @mbdebian I just checked the code and there shouldn't be any issues with merging these two steps. I think I can start working on this item.
The merging is done and the code is located in the branch 3099-merge-epmc
I ran a test of the ETL and the output that was generated is located in gs://open-targets-pre-data-releases/ricardo_24.04-2/output
@ireneisdoomed could you help me checking that the output is correct and hasn't been affected by the change?
The changes in 3099-merge-epmc
consist of moving the configuration for the epmc
step into the literature
one.
This affects 2 outputs:
epmcCooccurrences
. A dataset based on cooccurrences that we prepare for EPMC to ingest.evidence-files/epmc
. The t/d evidence based on literature.I've compared Ricardo's outputs (generated on 19/04) against the latest ETL run (generated on 23/04).
epmcCooccurrences:
ricardo_24.04-2
(5,762,312) than in 24.06dev
(5,722,755)ricardo_24.04-2
are not part of the 24.06 set. The latter one, being generated later, should be a superset but it isn't. IDs that fall out are not recent, articles date back to the nineties. Some example PMIDs: 10023798
, 10022500
, 10025924
.evidence/sourceId=europepmc:
10022500
, 10029211
, or 10028483
.Technically my understanding is that we shouldn't see any difference in these outputs. However, this is not straightforward to prove because the inputs to the literature step change between runs and are therefore not versioned. I think numbers indicate something is off here, I'd suggest reviewing @remo87. Ideally with test input data so you can reproduce.
Hi @ireneisdoomed yesterday I ran the literature step with and without the epmc integrated. I ran the step with a subset of the literature inputs ignoring the inputs from 2022. These inputs are located in gs://open-targets-pre-data-releases/ricardo-24.05/ml02/. The outputs for the literature step without epmc are located in gs://open-targets-pre-data-releases/ricardo-24.05/tmp/ and the outputs for the step with with epmc integrated are located in gs://open-targets-pre-data-releases/ricardo-24.05/output/etl/parquet/literature
@remo87 can you do the same checks I did on both runs to see your changes don't have an impact? Thanks
Hi @ireneisdoomed, I updated the outputs running the steps again using the same subset of inputs and copied the outputs to gs://open-targets-pre-data-releases/ricardo-24.05/tmp
the outputs with the epmc included are in the epmc
folder and the outputs for the etl that doesn't have the epmc included are in the noepmc
folder. I compared both outputs and I got the same count in both the count for the epmcCoocurrences was 17270 in both cases and the evidences were 30275.
Cool! Green flag from me then
As a developer I want to include the
epmc
into theliterature
step based on @d0choa's comment here: "They are conceptually the same step"Background
epmc
step corrently runs afterliterature
step.evidence
step runs afterepmc
although it wasn't mentioned hereTasks
epmc
intoliterature
epmc
inliteraure.scala
here as a inner step forliterature
.Acceptance tests
How do we know the task is complete?