opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

The EuroPMC entity recognition pipeline outputs empty files #3198

Closed DSuveges closed 7 months ago

DSuveges commented 8 months ago

Since late September 2023, all output files from the abstract pile is empty:

         0  2023-09-26T07:42:07Z  gs://otar025-epmc/ml02/abstract/2023_09_26/NMP_patch-25-09-2023-0.jsonl
         0  2023-09-27T07:41:44Z  gs://otar025-epmc/ml02/abstract/2023_09_27/NMP_patch-26-09-2023-0.jsonl
         0  2023-09-27T05:28:02Z  gs://otar025-epmc/ml02/abstract/2023_09_27/NMP_patch-26-09-2023-1.jsonl
         0  2023-09-28T09:38:31Z  gs://otar025-epmc/ml02/abstract/2023_09_28/NMP_patch-27-09-2023-0.jsonl
         0  2023-09-28T04:54:07Z  gs://otar025-epmc/ml02/abstract/2023_09_28/NMP_patch-27-09-2023-1.jsonl
         0  2023-09-29T08:12:30Z  gs://otar025-epmc/ml02/abstract/2023_09_29/NMP_patch-28-09-2023-0.jsonl
         0  2023-09-29T08:06:33Z  gs://otar025-epmc/ml02/abstract/2023_09_29/NMP_patch-28-09-2023-1.jsonl
         0  2023-09-29T05:35:51Z  gs://otar025-epmc/ml02/abstract/2023_09_29/NMP_patch-28-09-2023-2.jsonl
         0  2023-09-30T07:58:35Z  gs://otar025-epmc/ml02/abstract/2023_09_30/NMP_patch-29-09-2023-0.jsonl
         0  2023-09-30T06:28:55Z  gs://otar025-epmc/ml02/abstract/2023_09_30/NMP_patch-29-09-2023-1.jsonl
         0  2023-09-30T05:37:20Z  gs://otar025-epmc/ml02/abstract/2023_09_30/NMP_patch-29-09-2023-2.jsonl
         0  2023-10-01T06:24:35Z  gs://otar025-epmc/ml02/abstract/2023_10_01/NMP_patch-30-09-2023-0.jsonl
         0  2023-10-01T05:07:39Z  gs://otar025-epmc/ml02/abstract/2023_10_01/NMP_patch-30-09-2023-1.jsonl
         0  2023-10-02T06:18:34Z  gs://otar025-epmc/ml02/abstract/2023_10_02/NMP_patch-01-10-2023-0.jsonl

The progression of the pipeline is followed up on slack. However there was no indication if the jobs were failing (except checking the file contents manually). One example output is here. It says:

Successfully completed.

Resource usage summary:

CPU time : 13613.93 sec.
Max Memory : 77 MB
Average Memory : 68.19 MB
Total Requested Memory : 2048.00 MB
Delta Memory : 1971.00 MB
Max Swap : -
Max Processes : 4
Max Threads : 5
Run time : 14133 sec.
Turnaround time : 14136 sec.

This is the expected output: gs://otar025-epmc/ml02/abstract/2024_01_19/NMP_patch-18-01-2024-35.jsonl. No content:

─ gsutil cat gs://otar025-epmc/ml02/abstract/2024_01_19/NMP_patch-18-01-2024-35.jsonl

╭─

@tsantosh7 , could you please take a look? Also, please let us know if you need further details for the investigation. The full-text processing pipeline seemingly works fine.

tsantosh7 commented 8 months ago

Hi @DSuveges Just checked and indeed there are empty files. Dont know the reason for now and therefore would need investigation as the process runs without any problem.

DSuveges commented 8 months ago

Thank you @tsantosh7 for jumping onthe issue so quickly!

tsantosh7 commented 8 months ago

Hi @DSuveges

There was a bug in the abstract pipeline which is fixed now. I have rerun the pipeline for all the empty directories and updated the daily pipeline. All are reflected in the google storage now. Please kindly check and let me know if its ok?

DSuveges commented 7 months ago

Thanks @tsantosh7, the data is coming in! We can close this ticket.