opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

PIS stalls during openFDA step #3212

Closed jdhayhurst closed 4 months ago

jdhayhurst commented 5 months ago

Describe the bug During the OpenFDA step it logs that one (not always the same) of the zip files is being removed but hangs there indefinitely.

Observed behaviour Example logs:

Jan 29 11:05:08 pis-unset-pisvm-0rg5haj1 google_metadata_script_runner[918]: startup-script: 2024-01-29 11:05:08,708 modules.common.DownloadResource DEBUG - Start to download
Jan 29 11:05:08 pis-unset-pisvm-0rg5haj1 google_metadata_script_runner[918]: startup-script:         https://download.open.fda.gov/drug/event/2010q3/drug-event-0010-of-0010.json.zip
Jan 29 11:05:08 pis-unset-pisvm-0rg5haj1 google_metadata_script_runner[918]: startup-script: 2024-01-29 11:05:08,708 modules.common.DownloadResource INFO - [DOWNLOAD] BEGIN: 'https://download.open.fda.gov/drug/event/2010q3/drug-event-0010-of-0010.json.zip' -> '/srv/output/prod/fda-inputs/45a828ee-386d-4d94-88fe-02da46eb3ceb.zip'
Jan 29 11:05:12 pis-unset-pisvm-0rg5haj1 google_metadata_script_runner[918]: startup-script: 2024-01-29 11:05:12,418 modules.common.DownloadResource INFO - [DOWNLOAD] END, (1 attempt(s)): 'https://download.open.fda.gov/drug/event/2010q3/drug-event-0010-of-0010.json.zip' -> '/srv/output/prod/fda-inputs/45a828ee-386d-4d94-88fe-02da46eb3ceb.zip'
Jan 29 11:05:12 pis-unset-pisvm-0rg5haj1 google_metadata_script_runner[918]: startup-script: 2024-01-29 11:05:12,418 plugins.helpers.OpenfdaHelper DEBUG - Inflating event file 'drug-event-0010-of-0010.json', CRC '346048179'
Jan 29 11:05:19 pis-unset-pisvm-0rg5haj1 google_metadata_script_runner[918]: startup-script: 2024-01-29 11:05:19,649 plugins.helpers.OpenfdaHelper DEBUG - Removing processed ZIP file '/srv/output/prod/fda-inputs/45a828ee-386d-4d94-88fe-02da46eb3ceb.zip'

It will remain like this indefinitely and never continue.

Expected behaviour The file should be removed and continue or if there is an error an exception should be raised.

To Reproduce Steps to reproduce the behaviour:

  1. create VM on GCP with the following e2-standard-4, a boot disk with 500GB.
  2. install dependencies:
    sudo apt-get update
    sudo apt-get install ca-certificates curl gnupg
    sudo install -m 0755 -d /etc/apt/keyrings
    curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
    sudo chmod a+r /etc/apt/keyrings/docker.gpg
    echo \
    "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian \
    "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
    sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
    sudo apt-get update
    sudo apt-get install docker-ce tmux
    sudo usermod -a -G docker $USER
    newgrp docker
    mkdir -m 775 -p opentargets/credentials
    mkdir -m 775 -p opentargets/output
    mkdir -m 775 -p opentargets/log
    gsutil cp gs://open-targets-ops/credentials/pis-service_account.json opentargets/credentials/open-targets-gac.json
  3. Run the container
    
    tmux new -s pisrun

Set image and release versions

IMAGE_TAG="release_23-12" RELEASE_VERSION="devpis"

docker run -v /home/$USER/opentargets/output:/srv/output -v /home/$USER/opentargets/log:/usr/src/app/log -v /home/$USER/opentargets/credentials/open-targets-gac.json:/srv/credentials/open-targets-gac.json quay.io/opentargets/platform-input-support:$IMAGE_TAG -o /srv/output --log-level=DEBUG -gkey /srv/credentials/open-targets-gac.json -gb open-targets-pre-data-releases/$RELEASE_VERSION/input -steps openfda

jdhayhurst commented 5 months ago

Narrowed this down to the multiprocessing pool. It was configured with 2X CPUs, which is reasonable given that this is predominantly an IO bound process. It could be that the issue is with the queues that it uses. If the data are too big, there are pickling issues: https://bugs.python.org/issue8237. Either way, I tried the reducing the number of workers to the number of CPUs and that worked. Knowing that there are overheads with this approach, I also tested a multithreaded approach, but saw worse performance. This will be resolved with the merging of https://github.com/opentargets/platform-input-support/tree/3195_automate_running