theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
34 stars 16 forks source link

theiacov_fasta wf logic change for flu #305

Closed kapsakcj closed 6 months ago

kapsakcj commented 6 months ago

This PR closes #304

๐Ÿ—‘๏ธ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

Came across this bug during 1.3.0 release validation testing. This impacts the TheiaCoV_FASTA_PHB workflow only.

Mpox samples were accidentally run through the abricate task which is an Influenza-specific task. Workflow succeeded, but running this task is inappropriate for non-Flu samples.

The workflow code change is so that abricate is only run if user sets organism to flu AND flu_subtype is not set by user or unknown

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users donโ€™t change any workflow inputs relative to the last version: Yes

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

๐Ÿ”„ Data Processing

Docker/software or software versions changed: N/A

Databases or database versions changed: N/A

Data processing/commands changed: N/A

File processing changed: N/A

Compute resources changed: N/A

โžก๏ธ Inputs

N/A

โฌ…๏ธ Outputs

abricate task outputs will NOT be outputted for non-Flu samples (as desired)

:test_tube: Testing

Test Dataset

Using the PHB validation dataset for testing. Includes sars-cov-2, Flu, Mpox,

Commandline Testing with MiniWDL or Cromwell (optional)

Terra Testing

Suggested Scenarios for Reviewer to Test

Test TheiaCov_FASTA on Flu and non-Flu samples (sars-cov-2, Mpox, etc.)

Theiagen Version Release Testing (optional)

-Will changes require functional or validation testing (checking outputs etc) during the release? functional

:microscope: Final Developer Checklist

๐ŸŽฏ Reviewer Checklist

๐Ÿ—‚๏ธ Associated Documentation (to be completed by Theiagen developer)

kapsakcj commented 6 months ago

theiacov_fasta being run via cromwell is failing both in the GHActions CI as well as in my local devVM.

Failing on the pangolin task for sars-cov-2, which is unexpected since this has worked previously.

tests on terra were successful, not sure why it's failing in the CI but we need to investigate prior to merging

Will follow up after I've done more digging ๐Ÿ‘€

kapsakcj commented 6 months ago

I traced the issue down to an issue with pangolin when it is running scorpio as part of it's own snakemake workflow. Here's the relevant error from the scorpio.log, which is a temporary file that I could only view after re-running pangolin with the --notemp option to keep all intermediate files and logs. After digging I found this, which basically says that the OS is having an issue with really long paths/filenames. I think it's the multiprocessing python package that is complaining, specifically?

INFO: Found file /opt/conda/envs/pangolin/lib/python3.8/site-packages/constellations/definitions/cXE.json for constellation Omicron (XE-like) containing 6 defining mutations
INFO: Rules {'default': {'min_alt': 3, 'max_ref': 0}}
Process SyncManager-1:
Traceback (most recent call last):
  File "/opt/conda/envs/pangolin/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/envs/pangolin/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/pangolin/lib/python3.8/multiprocessing/managers.py", line 608, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/opt/conda/envs/pangolin/lib/python3.8/multiprocessing/managers.py", line 154, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/opt/conda/envs/pangolin/lib/python3.8/multiprocessing/connection.py", line 448, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/opt/conda/envs/pangolin/lib/python3.8/multiprocessing/connection.py", line 591, in __init__
    self._socket.bind(address)
OSError: AF_UNIX path too long
Traceback (most recent call last):
  File "/opt/conda/envs/pangolin/bin/scorpio", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/pangolin/lib/python3.8/site-packages/scorpio/__main__.py", line 272, in main
    args.func(args)
  File "/opt/conda/envs/pangolin/lib/python3.8/site-packages/scorpio/subcommands/classify.py", line 7, in run
    classify_constellations(options.input,
  File "/opt/conda/envs/pangolin/lib/python3.8/site-packages/scorpio/scripts/type_constellations.py", line 763, in classify_constellations
    manager = mp.Manager()
  File "/opt/conda/envs/pangolin/lib/python3.8/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/opt/conda/envs/pangolin/lib/python3.8/multiprocessing/managers.py", line 583, in start
    self._address = reader.recv()
  File "/opt/conda/envs/pangolin/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/opt/conda/envs/pangolin/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/opt/conda/envs/pangolin/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Proposed solution 1 (fast, only change CI)

Although this is not the ideal solution, we can set up the CI to skip scorpio which should resolve this error, but the downside is that it doesn't completely reproduce the theiacov_fasta workflow defaults (really all theiacov workflows that call pangolin). It will allow the tests to pass which is good, but we need to ensure that pangolin runs correctly/without error in Terra on sars-cov-2 samples where scorpio is run by default.

Proposed solution 2 (take a little bit longer, require more testing & validation in Terra, change pangolin WDL task file)

Another option is to use edit the WDL task for pangolin and always use the option --tempdir . so that all temporary files are written to the PWD . , instead of $TMP which often has long paths given by cromwell. For example, here's an automatically chosen directory for pangolin when --tempdir option from pangolin is NOT used:

resources: tmpdir=/cromwell-executions/theiacov_fasta/6060a451-4e8d-4ad1-8169-b76f5a3abbee/call-pangolin4/tmp.f21e9e7a`
kapsakcj commented 6 months ago

I implemented my proposed solution 1 to get the CI to pass in my last commit: 59bdce8

EDIT: also needed the 2 subsequent commits to get CI to pass completely:

Can un-do this change if necessary.

But the longer-term solution 2 may be the better option, but I would like more thorough testing before bringing a change like that into main

kapsakcj commented 6 months ago

Seems like others using cromwell have experienced this issue in the past. Specifically with python tools that use the mutliprocessing python package https://github.com/broadinstitute/cromwell/issues/3647

kapsakcj commented 6 months ago

Option 2 could be implemented with these changes (see below). I was able to circumvent the "pangolin/scorpio/python multiprocessing PATH too long" error with these changes.

It allowed my local cromwell testing on theiacov_fasta workflow to run successfully, without having to use the pangolin --skip-scorpio option.

## NEW LINE BELOW ##
export TMPDIR=/tmp

    pangolin "~{fasta}" \
       ~{'--analysis-mode ' + analysis_mode} \
       ~{'--min-length ' + min_length} \
       ~{'--max-ambig ' + max_ambig} \
       ~{true='--expanded-lineage' false='' expanded_lineage} \
       ~{true='--skip-scorpio' false='' skip_scorpio} \
       ~{true='--skip-designation-cache' false='' skip_designation_cache} \
       --outfile "~{samplename}.pangolin_report.csv" \
       --verbose \
       --tempdir /tmp \ ## THIS LINE IS ALSO NEW
       ~{pangolin_arguments}
cimendes commented 6 months ago

Given our discussions outside of GitHub, I'm going to approve the PR as is, with the changes reflected in the CI and not the pangolin task itself ๐Ÿ‘๐Ÿป