mourisl / Lighter

Fast and memory-efficient sequencing error corrector
GNU General Public License v3.0
92 stars 17 forks source link

Major behaviour change from 1.0.7 to 1.1.1 #24

Open tseemann opened 8 years ago

tseemann commented 8 years ago

Today I upgraded from lighter 1.0.7 to 1.1.1 and I first noticed a problem when 1.1.1 was outputting different number of reads in the two output files, and then noticed it was also passing far fewer reads.

This is the command line:

lighter -od . -r R1.fq.gz -r R2.fq.gz -K 32 4000000 -t 72 -maxcor 2

This is the difference in read counts:

Files   R1.fq.gz
Reads   3747457  # original reads
Files   R2.fq.gz
Reads   3747457

Files   1.0.7-R1.cor.fq.gz
Reads   3747457  # none missing
Files   1.0.7-R2.cor.fq.gz
Reads   3747457

Files   1.1.1-R1.cor.fq.gz
Reads   2511489  # lots missing
Files   1.1.1-R2.cor.fq.gz
Reads   2511506  # has 17 more reads!

Any ideas?

mourisl commented 8 years ago

I tested again on my data sets and could not trigger the bug you met. Is there a way for me to access the data set you use? If not, can you show me the summary of correction on screen output by Lighter? Thanks.

tseemann commented 8 years ago

I found the issue. If you compile with default -O2 option it works. In Linuxbrew, I used the system CXXFLAGS which sets -Os (size optimize), which causes the bug!
CC: @sjackman

See the output messages below:

Files   R1.fq.gz
Reads   3747457

This is g++ -O2 (which works)

$ ./lighter-1.1.1-O2 -od 1.1.1-O2 -r R1.fq.gz -r R2.fq.gz -K 32 4000000 -t 72 -maxcor 2
[2016-08-17 00:11:57] =============Start====================
[2016-08-17 00:11:57] Scanning the input files to infer alpha(sampling rate)
[2016-08-17 00:12:04] Average coverage is 141.346 and alpha is 0.050
[2016-08-17 00:12:05] Bad quality threshold is "B"
[2016-08-17 00:12:15] Finish sampling kmers
[2016-08-17 00:12:15] Bloom filter A's false positive rate: 0.006326
[2016-08-17 00:12:24] Finish storing trusted kmers
[2016-08-17 00:12:56] Finish error correction
Processed 7494914 reads:
        7042749 are error-free
        Corrected 579197 bases(1.280942 corrections for reads with errors)
        Trimmed 0 reads with average trimmed bases 0.000000
        Discard 0 reads

This is g++ -Os with missing reads!

$ ./lighter-1.1.1-Os -od 1.1.1-Os -r R1.fq.gz -r R2.fq.gz -K 32 4000000 -t 72 -maxcor 2
[2016-08-17 00:13:38] =============Start====================
[2016-08-17 00:13:38] Scanning the input files to infer alpha(sampling rate)
[2016-08-17 00:13:46] Average coverage is 141.346 and alpha is 0.050
[2016-08-17 00:13:47] Bad quality threshold is "B"
[2016-08-17 00:13:57] Finish sampling kmers
[2016-08-17 00:13:57] Bloom filter A's false positive rate: 0.006326
[2016-08-17 00:14:06] Finish storing trusted kmers
[2016-08-17 00:14:32] Finish error correction
Processed 5022995 reads:
        4719925 are error-free
        Corrected 388478 bases(1.281809 corrections for reads with errors)
        Trimmed 0 reads with average trimmed bases 0.000000
        Discard 0 reads
tseemann commented 7 years ago

Ping @mourisl - any ideas?

sjackman commented 7 years ago

As a workaround you can use ENV.O2 in the formula to use -O2 rather than the default -Os.