phac-nml / refseq_masher

Mash MinHash search your nucleotide sequences against a NCBI RefSeq genomes database
Apache License 2.0
39 stars 4 forks source link

Crashes when there are very few input reads #3

Closed emarinier closed 3 years ago

emarinier commented 3 years ago

I'm getting a crash when I run RefSeq Masher on very few input reads.

I can run Mash on the command line using my data and the Mash sketch provided from RefSeq Masher (the output is the same with or without the -w flag):

> mash screen -w -p 4 RefSeqSketches.msh tiny.fastq

Loading RefSeqSketches.msh...
   4669418 distinct hashes.
Streaming from tiny.fastq...
   Estimated distinct k-mers in mixture: 68
Summing shared...
Reallocating to winners...
Computing coverage medians...
Writing output...
0.687656    1/400   3   6.33297e-06 ./rcn/refseq-NC-765187-PRJNA50279-.-.-.-Chimaera_fulva.fna  
0.689638    1/382   2   6.04799e-06 ./rcn/refseq-NZ-670-PRJNA224116-SAMN03349604-GCF_000960665.1-unnamed-Vibrio_parahaemolyticus.fna

However, when I run RefSeq Masher:

> refseq_masher contains tiny.fastq

Loading miniconda3/envs/proksee-dev/lib/python3.7/site-packages/refseq_masher/data/RefSeqSketches.msh...
   4669418 distinct hashes.
Streaming from tiny.fastq...
   Estimated distinct k-mers in mixture: 68
Summing shared...
Computing coverage medians...
Writing output...
Traceback (most recent call last):
  File "miniconda3/envs/proksee-dev/bin/refseq_masher", line 10, in <module>
    sys.exit(cli())
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/refseq_masher/cli.py", line 136, in contains
    parallelism=parallelism)
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/refseq_masher/mash/screen.py", line 46, in vs_refseq
    df = mash_screen_output_to_dataframe(stdout)
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/refseq_masher/mash/parser.py", line 117, in mash_screen_output_to_dataframe
    df = pd.read_table(StringIO(mash_out))
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/pandas/io/parsers.py", line 765, in read_table
    return read_csv(**locals())
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/pandas/io/parsers.py", line 686, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/pandas/io/parsers.py", line 452, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/pandas/io/parsers.py", line 946, in __init__
    self._make_engine(self.engine)
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/pandas/io/parsers.py", line 1178, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "miniconda3/envs/proksee-dev/lib/python3.7/site-packages/pandas/io/parsers.py", line 2008, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 540, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

I cannot upload my mock reads (tiny.fastq), so I will post them as text here:

@READ1
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTAAAAACCCCCGGGGGTTTTTAAAAAACCCCCCGGGGGGTTTTTTAAAAAAACCCCCCCGGG
+
CCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCC-CCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCC-CCCCC
@READ2
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTAAAAACCCCCGGGGGTTTTTAAAAAACCCCCCGGGGGGTTTTTTAAAAAAACCCCCCCG
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
@READ3
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTAAAAACCCCCGGGGGTTTTTAAAAAACCCCCCGGGGGGTTTTTTAAAAAAACCCCCCCGGG
+
CCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;-CCCCCCCCCCCCC;CCCCCCCC-CC;CCCCC
emarinier commented 3 years ago

It appears that when running Mash screen on the above test file with the same default values RefSeq Masher uses (mash screen -v 0.01 -p 1 -i 0.9), Mash screen reports no output. That is, the output file is empty. RefSeq Masher then tries to process this file with the assumption that there will be a certain number of columns in the Mash output, but there are no columns because the file is empty.

I've opened a pull request (#4) that handles empty Mash output by returning None instead of a dataframe is some places. I've added logic to check for None dataframes.

peterk87 commented 3 years ago

I think the winner-takes-all strategy (mash screen -w ...) was added to Mash screen soon after I wrote this tool and I was using certain options to filter out most of the spurious looking results. Those options could probably be replaced with the -w flag, but that might require some testing.

emarinier commented 3 years ago

Resolved in #4.