Merge `epmc` into the `literature` step

ehsanb29 commented 1 year ago

As a developer I want to include the epmc into the literature step based on @d0choa's comment here: "They are conceptually the same step"

Background

epmc step corrently runs after literature step. evidence step runs after epmc although it wasn't mentioned here

Tasks

[x] Update reference.conf file to include epmc into literature
[x] Update refers to empc configuration in empc.scala
[x] Add epmc in literaure.scala here as a inner step for literature.
[x] Check if we need to change anything in API

Acceptance tests

How do we know the task is complete?

When the tests was successful.
When Data team approved the outputs.

prashantuniyal02 commented 1 year ago

depends on: #3111

mbdebian commented 7 months ago

@remo87 , would you mind checking with @d0choa on whether this is doable for the next release? Thanks!

d0choa commented 7 months ago

This is a refactor. These are 2 steps in the ETL that are virtually the same thing. They process the literature inputs and produce different outputs. In the past, one of them was a different repo, that was eventually migrated to the same repo but never harmonised with the existing codebase. At the moment, they have 2 configuration objects, 2 steps, etc. but as said above they are conceptually the same thing.

@remo87 if you can spend a little bit of time evaluating the complexity of the task it could help. Nothing is broken but we could clean up quite a bit of code and make things more clear.

remo87 commented 7 months ago

@d0choa @mbdebian I just checked the code and there shouldn't be any issues with merging these two steps. I think I can start working on this item.

remo87 commented 6 months ago

The merging is done and the code is located in the branch 3099-merge-epmc I ran a test of the ETL and the output that was generated is located in gs://open-targets-pre-data-releases/ricardo_24.04-2/output @ireneisdoomed could you help me checking that the output is correct and hasn't been affected by the change?

ireneisdoomed commented 6 months ago

The changes in 3099-merge-epmc consist of moving the configuration for the epmc step into the literature one.

This affects 2 outputs:

epmcCooccurrences. A dataset based on cooccurrences that we prepare for EPMC to ingest.
evidence-files/epmc. The t/d evidence based on literature.

I've compared Ricardo's outputs (generated on 19/04) against the latest ETL run (generated on 23/04).

epmcCooccurrences:
- Despite being an earlier run (and therefore processing less full texts), there are more IDs in ricardo_24.04-2 (5,762,312) than in 24.06dev (5,722,755)
- Another suspicious thing is that ~3% of the IDs in ricardo_24.04-2 are not part of the 24.06 set. The latter one, being generated later, should be a superset but it isn't. IDs that fall out are not recent, articles date back to the nineties. Some example PMIDs: 10023798, 10022500, 10025924.
evidence/sourceId=europepmc:
- Total counts here make more sense. 24.06 has 11,298,386 evidence and ricardo_24.04 has 11,274,540.
- However there are more literature IDs in ricardo_24.04 (2,912,018 vs 2,879,759), which is, again, odd. And again the difference include older IDs like 10022500, 10029211, or 10028483.

Technically my understanding is that we shouldn't see any difference in these outputs. However, this is not straightforward to prove because the inputs to the literature step change between runs and are therefore not versioned. I think numbers indicate something is off here, I'd suggest reviewing @remo87. Ideally with test input data so you can reproduce.

remo87 commented 6 months ago

Hi @ireneisdoomed yesterday I ran the literature step with and without the epmc integrated. I ran the step with a subset of the literature inputs ignoring the inputs from 2022. These inputs are located in gs://open-targets-pre-data-releases/ricardo-24.05/ml02/. The outputs for the literature step without epmc are located in gs://open-targets-pre-data-releases/ricardo-24.05/tmp/ and the outputs for the step with with epmc integrated are located in gs://open-targets-pre-data-releases/ricardo-24.05/output/etl/parquet/literature

ireneisdoomed commented 6 months ago

@remo87 can you do the same checks I did on both runs to see your changes don't have an impact? Thanks

remo87 commented 5 months ago

Hi @ireneisdoomed, I updated the outputs running the steps again using the same subset of inputs and copied the outputs to gs://open-targets-pre-data-releases/ricardo-24.05/tmp the outputs with the epmc included are in the epmc folder and the outputs for the etl that doesn't have the epmc included are in the noepmc folder. I compared both outputs and I got the same count in both the count for the epmcCoocurrences was 17270 in both cases and the evidences were 30275.

ireneisdoomed commented 5 months ago

Cool! Green flag from me then

opentargets / issues