phac-nml / irida

Canada’s Integrated Rapid Infectious Disease Analysis Platform for Genomic Epidemiology
https://irida.ca
Apache License 2.0
40 stars 31 forks source link

Extract RefSeqMasher Pipeline into plugin #1016

Open innovate-invent opened 3 years ago

innovate-invent commented 3 years ago

What needs changed?

RefSeqMasher Pipeline needs to be pulled out into a plugin. The workflow needs to be imported and then re-exported from Galaxy 21.01 or later as it currently does not function with recent versions.

https://github.com/phac-nml/irida/tree/development/src/main/resources/ca/corefacility/bioinformatics/irida/model/workflow/analysis/type/workflows/RefSeqMasherOnPairedReads/0.1

innovate-invent commented 3 years ago

@apetkau you had asked for more information about the issue I am having with this pipeline.

I am getting the following error:

2021-06-17 19:19:12,395 INFO: Grouped 2 fastqs into 1 groups [in /usr/local/lib/python3.7/site-packages/refseq_masher/utils.py:174]
2021-06-17 19:19:12,395 INFO: Collected 0 FASTA inputs and 1 read sets [in /usr/local/lib/python3.7/site-packages/refseq_masher/utils.py:185]
2021-06-17 19:19:12,395 INFO: Running Mash Screen with NCBI RefSeq sketch database against sample "SRR3028776" with inputs: ['SRR3028776_1.fastq', 'SRR3028776_2.fastq'] [in /usr/local/lib/python3.7/site-packages/refseq_masher/mash/screen.py:44]
Loading /usr/local/lib/python3.7/site-packages/refseq_masher/data/RefSeqSketches.msh...
Traceback (most recent call last):
  File "/usr/local/bin/refseq_masher", line 10, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/refseq_masher/cli.py", line 136, in contains
    parallelism=parallelism)
  File "/usr/local/lib/python3.7/site-packages/refseq_masher/mash/screen.py", line 46, in vs_refseq
    df = mash_screen_output_to_dataframe(stdout)
  File "/usr/local/lib/python3.7/site-packages/refseq_masher/mash/parser.py", line 117, in mash_screen_output_to_dataframe
    df = pd.read_table(StringIO(mash_out))
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1917, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
apetkau commented 3 years ago

Thanks @innovate-invent . I am still not sure why this is causing an issue since it does work for me.

However, I do notice it referring to Python in /usr/local/lib/python3.7. Is it using a system Python? Or are you using a singularity container? That could be one place to look into since I am using a conda environment to install the software.

innovate-invent commented 3 years ago

I am running it in docker using https://quay.io/repository/biocontainers/refseq_masher

I think it is this issue: https://github.com/phac-nml/refseq_masher/issues/2

Looks like the Galaxy wrapper points at 0.1.1 but 0.1.2 is available.

apetkau commented 3 years ago

Okay, thanks.

Do you know which Docker container tag you are using (https://quay.io/repository/biocontainers/refseq_masher?tab=tags)? I suspect it's 0.1.1--py_2 since that's the tag where I see it using Python 3.7.

Also, what do the input fastq files for SRR3028776 look like? As in, are they very small datasets for testing, or are they full-sized fastq files? The error you are getting is EmptyDataError, so it may just be that the fastq files you are testing are too small and produce no results.

innovate-invent commented 3 years ago

quay.io/biocontainers/refseq_masher:0.1.1--py_2

The input fastq are 200MB each and were used to test earlier versions of IRIDA.

innovate-invent commented 3 years ago

Look like the issue was that the tool wrapper version was bumped but not the package version: https://github.com/phac-nml/galaxy_tools/pull/213