qiime2 / q2-dada2

QIIME 2 plugin wrapping DADA2
BSD 3-Clause "New" or "Revised" License
19 stars 36 forks source link

Segfault errors in 2017.12 plugin version #80

Closed benjjneb closed 6 years ago

benjjneb commented 6 years ago

Adding this as a tracking issue for a seemingly related group of errors that popped up after the plugin was updated to use the current version of the R package (1.6) in Q2 2017.12

https://forum.qiime2.org/t/dada2-error-caught-segfault/2322 https://forum.qiime2.org/t/dada2-exit-code-11/2333 https://forum.qiime2.org/t/dada2-errors-return-code-1-and-11/2380 https://forum.qiime2.org/t/dada2-11-exit-code-paired-end-reads/2360

In every case, these users are seeing the following:

Plugin error from dada2:
  An error was encountered while running DADA2 in R (return code -11), please inspect stdout and stderr to learn more.

And in the log file...

2) Learning Error Rates
Initializing error rates to maximum possible estimate.
Sample 1 - 19785 reads in 10633 unique sequences.
*** caught segfault ***
address 0x8, cause ‘memory not mapped’

This indicates the segfault error is happening the first time that the dada function in the R package is called, and that some other functions (such as filtering) are working OK.

Also in all cases reported thus far, the affected users are running OSX, although several different versions have been reported so far:

I am using macOS High Sierra Version 10.13.1
I am using OX X Yosemite Version 10.10.5
I have macOS High Sierra Version 10.13.2.
I am running High Sierra 10.13.1.
ebolyen commented 6 years ago

There is also Martin who has Debian 4.9.0-4-amd64.

I found this topic, which makes it look like Rcpp sets 0x8 as a null pointer, but not 100% sure. Is it possible the new struct fields on Raw, kmer8 and kord are getting freed too soon?

We've also seen that our test data doesn't have this issue on affected hosts so I'm thinking it isn't a linking/conda issue. Moving pictures seems to be enough to trigger it though, so there must be some kind of branching. Additionally we can't reproduce this on our machines.

benjjneb commented 6 years ago

There is also Martin who has Debian 4.9.0-4-amd64.

Good catch, so not just an OSX issue.

Additionally we can't reproduce this on our machines.

I can't on either of my Mac laptops either (using moving pictures). Also, on the R side I haven't yet seen any reports of this issue cropping up.

Question: Is there a way that one of the affected users could edit the Q2 installed dada2 R script? If so, it would be worth checking if turning off the explicit SSE code makes this go away.

ebolyen commented 6 years ago

Question: Is there a way that one of the affected users could edit the Q2 installed dada2 R script? If so, it would be worth checking if turning off the explicit SSE code makes this go away.

We could probably create some throw-away instructions to patch the included script. Should I put something like that together as a next-step? I'm guessing we want SSE=0 to make it infer (that's what I think the code does...)?

ebolyen commented 6 years ago

Also as more information, everyone seems to have AVX2 available on their host.

benjjneb commented 6 years ago

I'm guessing we want SSE=0 to make it infer (that's what I think the code does...)?

Yep, exactly. That will revert back to the non-explicit-SSE implementation used in 1.4.

Dunno if that is what this is, but definitely worth trying.

ebolyen commented 6 years ago

Sounds good, I'll put something together!

benjjneb commented 6 years ago

Sounds good, I'll put something together!

Let me know if that doesn't work. I'm definitely motivated to help fix this, but right now am feeling stuck without the ability to reproduce the issue.

ebolyen commented 6 years ago

@benjjneb, great news! We were able to get a hold of a dataset and can reproduce the issue locally. Also patching the script to set SSE=0 seems to fix the problem.

We're hoping to release a patch for q2-dada2 with these fixes. Do you think it makes sense to parametrize the SSE flag, or should we just force it to be 0 for now?

ebolyen commented 6 years ago

Data is here

benjjneb commented 6 years ago

We were able to get a hold of a dataset and can reproduce the issue locally.

Great! Will try reproducing locally with this data too. What is the exact command you guys are running to reproduce the segfault?

We're hoping to release a patch for q2-dada2 with these fixes. Do you think it makes sense to parametrize the SSE flag, or should we just force it to be 0 for now?

Assuming SSE=0 is confirmed to fix the issue, my opinion would be that (sadly) it should just be forced to 0. That kills most of the performance gains from the update, as that is the part of the code specifically rewritten to speed up the bioconda build, but even if crashes are a minority it is probably worth prioritizing that.

Long term I don't think making an SSE flag makes sense. The SSE setting is a purely internal thing, that should not affect output in anyway. I'll keep it R side for dev purposes, but seems like clutter on the plugin.

benjjneb commented 6 years ago

Hooray, I can reproduce it on my end! Running the following (from the thread data was posted in):

qiime dada2 denoise-paired --i-demultiplexed-seqs 3_2_demux-paired-end.qza --o-table run2_table.qza --o-representative-sequences run2_rep-seqs.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 300 --p-trunc-len-r 240 --verbose

Will look into this more tomorrow. My first guess is that I introduced some edge case memory bug into the C code.

benjjneb commented 6 years ago

Quick update: I can reproduce the crash with SSE=2, and can confirm that SSE=0 fixes the crash, but it also appears that SSE=1 fixes the crash.

I'm going to run some more comprehensive tests, but if I can confirm that SSE=1 works, then SSE=1 the right patch to make, as SSE=1 still preserves the explicit SSE speedups. Will try to confirm this by the end of the day.

SSE=2 is a "packed" 1-byte implementation, while SSE=1 is a simpler two-byte implementation. SSE=2 is a bit faster, but SSE=1 will still be way faster than the non-explicit-vectorized SSE=0 for the bioconda build

benjjneb commented 6 years ago

The underlying bug is finally squashed on the R side https://github.com/benjjneb/dada2/commit/4a89b96cee0b4cce6b16d53608cf7f631e498369

So next release cycle (1.8) can move back to SSE=2