nunofonseca / irap

integrated RNA-seq Analysis Pipeline
GNU General Public License v3.0
82 stars 33 forks source link

Contamination check errors #38

Closed freekvh closed 6 years ago

freekvh commented 6 years ago

Hi Nuno,

While using iRAP to filter out possible contamination from a mouse genome, we ran into an issue (mind you we are still using 0.8.5p2 but I didn't see any indication of a fix in the mean time).

In the script irap_fastq_qc (in the scripts folder), on line 680, bowtie is used with the options --un, this allows one to generate a sam file with all reads mapped to the contaminating genome and a new fastq file with all those reads removed. The sam file is stored as a bam file (with samtool view -b), but this specific process goes wrong with an error which I don't have handy right now.

What we did to fix it is remove the sam to bam command and then the pipeline progresses without problems (the bam file with reads from the contaminating organism does not seem to be used anymore). We highly suspect that the error is due to a space put in the read name put there by Bowtie. We found online that other people also seem have the same problem with Bowtie specifically.

If you want we can supply more details but as said I don't have them handy and perhaps you already know what could be wrong. Please let me know if your want me to be more specific.

We have solved this problem for ourselves but perhaps you may also want to solve it in iRAP.

Highest regards,

Freek.

Edit: A colleague hunted down the error, this was it (following samtools view -b contamination_reads.sam):

[W::sam_read1] parse error at line 1
[main_samview] truncated file.
Error while flushing and closing output
terminate called after throwing an instance of 'int'
nunofonseca commented 6 years ago

Hi, Thank you for the detailed report. In September (version 0.8.5.p3) there was a fix in the invocation of Bowtie - I'm not sure if it would solve the above issue. If you are willing to try please let me know if it continues to fail. I made several changes to the script for the coming 0.9 version which should make the script more robust and flexible. If it is a bowtie bug an alternative is to use bowtie2 by adding cont_mapper=bowtie2 in the configuration file. Cheeers.

freekvh commented 6 years ago

Thanx for the swift reply, as always.

Depending on how our run ends in the next days, we will first test with bowtie2 as the cont_mapper and perhaps then try a newer version.

I was wondering, is there a changelog, I can't find it in the codebase anywhere.. but that could be me.

And one more small question I have: iRAP is aimed at RNAseq but in principle it could be used with DNAseq (using Bowtie or BWA as the mapper), is this recommended or are there really RNA specific steps in the steps that lead to the sorted Bam file?

Looking forward to 0.9,

Highest regards,

Freek.

nunofonseca commented 6 years ago

Hi, There is no changelog, however you may get the list of changes (in the last 6 months) by running git log --since="6 months ago" in the cloned iRAP folder. Yes, in theory iRAP could be used to align DNAseq data. The only step that may be missing is to remove duplicated reads (you may check the fastqc report to see if this is a problem in your data). Cheers.

nunofonseca commented 6 years ago

Hi! In preparation of the coming release (0.9.0) I'm going through all issues. This seems to sorted in the new version - please reopen the issue if necessary. Cheers.