phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes
Apache License 2.0
25 stars 7 forks source link

Failing on custom TB dataset #35

Closed Takadonet closed 6 years ago

Takadonet commented 6 years ago

Fatal error: Exit code 1 () 2018-04-30 17:39:08,306 DEBUG: Namespace(files=[], force=False, input_directory=None, input_fasta_genome_name=None, json=False, keep_tmp=False, low_cov_depth_freq=20, low_cov_warning=20, max_intermediate_tiles=0.05, max_kmer_freq=1000, max_missing_tiles=0.05, min_ambiguous_tiles=3, min_kmer_freq=8, output_simple_summary='tech_results.tab', output_summary='results.tab', output_tile_results='match_results.tab', paired_reads=[['180029859336_1.fastq', '180029859336_2.fastq']], scheme='TBhanselcoll2014withbov.txt', scheme_name=None, slow=False, threads=1, tmp_dir='/tmp', verbose=3) [in /Drives//_conda/envs/mulled-v1-e54006e9d4e461310040891496d1e8e6bb0b9afb65e2e47957fb378d2c44bf08/lib/python3.5/site-packages/bio_hansel/main.py:215] 2018-04-30 17:39:08,308 INFO: Serial single threaded run mode on 1 input genomes [in /deps/_conda/envs/mulled-v1-e54006e9d4e461310040891496d1e8e6bb0b9afb65e2e47957fb378d2c44bf08/lib/python3.5/site-packages/bio_hansel/subtyper.py:495] 2018-04-30 17:39:08,309 INFO: genome_name 180029859336 [in /deps/_conda/envs/mulled-v1-e54006e9d4e461310040891496d1e8e6bb0b9afb65e2e47957fb378d2c44bf08/lib/python3.5/site-packages/bio_hansel/subtyper.py:407] 2018-04-30 17:39:27,328 WARNING: No subtyping tile matches for input "['1800298__59336_1.fastq', '180029859336_2.fastq']" for scheme "TBhanselcoll2014withbov.txt" [in /deps/_conda/envs/mulled-v1-e54006e9d4e461310040891496d1e8e6bb0b9afb65e2e47957fb378d2c44bf08/lib/python3.5/site-packages/bio_hansel/subtyper.py:431] 2018-04-30 17:39:29,214 INFO: Wrote subtyping output summary to results.tab [in /deps/_conda/envs/mulled-v1-e54006e9d4e461310040891496d1e8e6bb0b9afb65e2e47957fb378d2c44bf08/lib/python3.5/site-packages/bio_hansel/main.py:280] Traceback (most recent call last): File "/deps/_conda/envs/mulled-v1-e54006e9d4e461310040891496d1e8e6bb0b9afb65e2e47957fb378d2c44bf08/bin/hansel", line 11, in load_entry_point('bio-hansel==1.3.0', 'console_scripts', 'hansel')() File "/deps/_conda/envs/mulled-v1-e54006e9d4e461310040891496d1e8e6bb0b9afb65e2e47957fb378d2c44bf08/lib/python3.5/site-packages/bio_hansel/main.py", line 286, in main dfall = pd.concat(dfs) # type: pd.DataFrame File "/deps/_conda/envs/mulled-v1-e54006e9d4e461310040891496d1e8e6bb0b9afb65e2e47957fb378d2c44bf08/lib/python3.5/site-packages/pandas/core/reshape/concat.py", line 212, in concat copy=copy) File "/deps/_conda/envs/mulled-v1-e54006e9d4e461310040891496d1e8e6bb0b9afb65e2e47957fb378d2c44bf08/lib/python3.5/site-packages/pandas/core/reshape/concat.py", line 245, in init raise ValueError('No objects to concatenate') ValueError: No objects to concatenate

peterk87 commented 6 years ago

It seems like no results are generated so I'll add a check to see if dfs actually contains something, otherwise, no tile results output file will be written. Is the output summary written to results.tab? If so what does it contain?

What's the custom scheme TBhanselcoll2014withbov.txt? Is it FASTA format? Does it contain the tile position, negative control status and hierarchical subtype in the FASTA header?

For example, a negative control tile at position 3187428 with subtype 2.2.3.1.1

>negative3187428-2.2.3.1.1
CTTTATCAGCGCGCAGTGTCCCATTCCATCATC
dankein commented 6 years ago

Hi Peter,

Those failures were from samples that I ran. I made the scheme for Mycobacterium tuberculosis from the set of SNPs described in: Coll F, McNerney R, Guerra-Assunção JA, Glynn JR, Perdigão J, Viveiros M, et al. A robust SNP barcode for typing Mycobacterium tuberculosis complex strains. Nat Commun. 2014;5: 4812. doi:10.1038/ncomms5812

It's in the proper format and works with M. tuberculosis samples, but the failures are caused by running the scheme on sequences that are other Mycobacteria but not M. tuberculosis. Those samples likely don't contain any of the kmers in them.

Here's a sample of the TBhanselcoll2014withbov.txt scheme. Let me know if you'd like the whole thing.

`

615938-1 CCGGCCTGCTCTCCGAAGCACTGACGGATGCCG negative615938-1 CCGGCCTGCTCTCCGAGGCACTGACGGATGCCG 4404247-1.1 ACGATCGTGGGATGCTAGTCTCAACGCAGACGC negative4404247-1.1 ACGATCGTGGGATGCTGGTCTCAACGCAGACGC 3021283-1.1.1 CCACCTTGGGCTTGCGAGTCTACCTCGCGTGGA negative3021283-1.1.1 CCACCTTGGGCTTGCGGGTCTACCTCGCGTGGA 3216553-1.1.1.1 CCCCCGTCGCCGTCTGAACCATAAGCCCCACCA negative3216553-1.1.1.1 CCCCCGTCGCCGTCTGGACCATAAGCCCCACCA 2622402-1.1.2 CGACATCCTCGATACGAGCCCCCTCGCGGATTG negative2622402-1.1.2 CGACATCCTCGATACGGGCCCCCTCGCGGATTG 1491275-1.1.3 ACGCGTCCTTCGGGAAATGCGCTGGGACCCAAT negative1491275-1.1.3 ACGCGTCCTTCGGGAAGTGCGCTGGGACCCAAT 3479545-1.2.1 CCGCAGTTTCAGTCGCAGCCTTGACTATCTACG negative3479545-1.2.1 CCGCAGTTTCAGTCGCCGCCTTGACTATCTACG 3470377-1.2.2 TGGCATCGTCATAGGCTTGCTGGCGGTTAAGGA negative3470377-1.2.2 TGGCATCGTCATAGGCCTGCTGGCGGTTAAGGA 497491-2 AGGGCTGGTCGGCCATATCGGGCCCGACGATAT negative497491-2 AGGGCTGGTCGGCCATGTCGGGCCCGACGATAT 1881090-2.1 GTGCCGCGCTGGCCGGTGCTGGTGCTGCGCTAC negative1881090-2.1 GTGCCGCGCTGGCCGGCGCTGGTGCTGCGCTAC 2505085-2.2 CGGGCAGCGAGTCATCAGCCAACGATTGCGGCT negative2505085-2.2`

mgopez commented 6 years ago

Original issue fixed in #36.

peterk87 commented 6 years ago

Hi @dankein

The issue should be fixed as of 880db7232541bd4d4e65e388c6da7fa44bf5f3b4 and available from PyPI as version 1.3.2 (https://pypi.org/project/bio-hansel/1.3.2/).

It should be updated on BioConda and Galaxy soon.

Thanks for letting us know about the issue!