Working through a dataset, I found that most of the resulting alignments only included 100K-200K sequence identifiers from the input dataset even though most of my samples have >1M sequences. Unsure of what was going on, I tried running bowtie2 manually (according to the command call here). That's when I noticed my OS was killing bowtie2 with signal 9:
After this happened, I checked the exit code (using echo $?) and saw error code 1. As best as I can tell there's nowhere in the SHOGUN code that checks for the exit code of bowtie2. While it is being returned here:
The worse thing about this error is that since SHOGUN won't fail or catch this error, you can successfully process a dataset and generate incomplete contingency tables. The resulting SAM file is written to disk but it obviously incomplete, unfortunately shogun assign_taxonomy doesn't know this so it just processes the dataset as expected.
In my case running on a 32GB system my samples were missing around 60-80% of their reads.
Good call and a thorough investigation. This is indeed a nightmare situation where there is a silent bug. We should open up a PR and handle exit codes from the aligners.
Working through a dataset, I found that most of the resulting alignments only included 100K-200K sequence identifiers from the input dataset even though most of my samples have >1M sequences. Unsure of what was going on, I tried running bowtie2 manually (according to the command call here). That's when I noticed my OS was killing bowtie2 with signal 9:
After this happened, I checked the exit code (using
echo $?
) and saw error code 1. As best as I can tell there's nowhere in the SHOGUN code that checks for the exit code of bowtie2. While it is being returned here:https://github.com/knights-lab/SHOGUN/blob/24109b719463e7797af116b819e1adf89e38815f/shogun/aligners/bowtie2_aligner.py#L32-L38
There's no checks for it in align method calls:
https://github.com/knights-lab/SHOGUN/blob/24109b719463e7797af116b819e1adf89e38815f/shogun/__main__.py#L75
https://github.com/knights-lab/SHOGUN/blob/24109b719463e7797af116b819e1adf89e38815f/shogun/__main__.py#L78
The worse thing about this error is that since SHOGUN won't fail or catch this error, you can successfully process a dataset and generate incomplete contingency tables. The resulting SAM file is written to disk but it obviously incomplete, unfortunately
shogun assign_taxonomy
doesn't know this so it just processes the dataset as expected.In my case running on a 32GB system my samples were missing around 60-80% of their reads.