mourisl / Rcorrector

Error correction for Illumina RNA-seq reads
GNU General Public License v3.0
63 stars 18 forks source link

Rcorrector dropping reads? #12

Closed macmanes closed 6 years ago

macmanes commented 6 years ago

I've uncovered a strange behavior that I'm having trouble understanding.. RCorrector installed via conda. I end up having lost a few R2 reads.

rcorrector                1.0.2                         1    bioconda

The command

perl run_rcorrector.pl -1 test.1.fq -2 test.2.fq

Output: Note the right number of reads are processed

Put the kmers into bloom filter
jellyfish bc -m 23 -s 100000000 -C -t 1 -o tmp_8946be6f4ed713dd539613da8d3abdf8.bc test.1.fq test.2.fq
Count the kmers in the bloom filter
jellyfish count -m 23 -s 100000 -C -t 1 --bc tmp_8946be6f4ed713dd539613da8d3abdf8.bc -o tmp_8946be6f4ed713dd539613da8d3abdf8.mer_counts test.1.fq test.2.fq
Dump the kmers
jellyfish dump -L 2 tmp_8946be6f4ed713dd539613da8d3abdf8.mer_counts > tmp_8946be6f4ed713dd539613da8d3abdf8.jf_dump
Error correction
/pylon5/mc3bg6p/macmanes/conda/Oyster_River_Protocol/software/anaconda/install/envs/orp_v2/bin/rcorrector   -p test.1.fq test.2.fq -c tmp_8946be6f4ed713dd539613da8d3abdf8.jf_dump
Stored 81770 kmers
Weak kmer threshold rate: 0.004975 (estimated from 0.950/1 of the chosen kmers)
Bad quality threshold is :
Processed 53898 reads
    Corrected 18524 bases.

Counting corrected reads: Note a few R2 reads lost

grep -c @DRR test.1.cor.fq
26949

grep -c @DRR test.2.cor.fq
26265

Counting original reads: 53898 total reads.

grep -c @DRR test.1.fq
26949

grep -c @DRR test.2.fq
26949

The dataset is here: https://github.com/macmanes-lab/Oyster_River_Protocol/tree/master/sampledata

I suspect some environment-specific issue here, but I can't understand what this could be.

macmanes commented 6 years ago

Further, when I reveres the order of the reads on input (incorrectly)

perl run_rcorrector.pl -1 test.2.fq -2 test.1.fq

I end up with fewer R1 reads.. Something about the -2 position/flag.

mourisl commented 6 years ago

Could you please try the newest version that has not yet been released yet? It could be fixed from the update of a2624ec from April 14, 2017.

macmanes commented 6 years ago

Yes, that fixes it. Are you willing to cut a patch release so I can get the fixed version in to bioconda? Thanks!

mourisl commented 6 years ago

Thanks! I just released the v1.0.3.

mourisl commented 5 years ago

I think this time I found the real reason causing the dropping reads from the second-mate file. And I've release v1.0.4 for this fix.