karel-brinda / rnftools

RNF framework for NGS: simulation of reads, evaluation of mappers, conversion of RNF-compliant data.
http://karel-brinda.github.io/rnftools
MIT License
14 stars 5 forks source link

Any idea what this error #49

Closed astewart-twist closed 7 years ago

astewart-twist commented 8 years ago

Running a very basic example from the documentaiton:

import rnftools
import smbl

rnftools.mishmash.sample("miseq",reads_in_tuple=1)

rnftools.mishmash.ArtIllumina(
    fasta="reference.fa",
    number_of_read_tuples=10000,
    read_length_1=100,
    read_length_2=0,
)

include: rnftools.include()
rule: input: rnftools.input()

Then running snakemake I get the following.

snakemake                                                                                  
Multiple include of /usr/local/lib/python3.4/site-packages/smbl/include_all.snake ignored                                          
Provided cores: 1                                                                                                                  
Rules claiming more threads will be scaled down.                                                                                   
Job counts:                                                                                                                        
        count   jobs                                                                                                               
        1       42                                                                                                                 
        1       43                                                                                                                 
        1       45                                                                                                                 
        1       index_fasta                                                                                                        
        4                                                                                                                          
Creating index for a FASTA file (template.fa).                                                                                     
1 of 4 steps (25%) done                                                                                                            
rule 43:                                                                                                                           
        input: /root/.smbl/bin/art_illumina, /root/.smbl/bin/samtools, template.fa, template.fa.fai                                
        output: miseq1/001/_final_reads.fq, miseq1/001/tmp.1.sam, miseq1/001/tmp.1.corrected.sam, miseq1/001/tmp.1.fq              
Error in job 43 while creating output files miseq1/001/_final_reads.fq, miseq1/001/tmp.1.sam, miseq1/001/tmp.1.corrected.sam, miseq
1/001/tmp.1.fq.                                                                                                                    
RuleException:                                                                                                                     
CalledProcessError in line 52 of /usr/local/lib/python3.4/site-packages/rnftools/mishmash/mishmash.snake:                          
Command 'cat "miseq1/001/tmp.1.sam" | grep -v ^@ | "/root/.smbl/bin/samtools" view -h -T "template.fa" - > "miseq1/001/tmp.1.correc
ted.sam"' returned non-zero exit status 1                                                                                          
  File "/usr/local/lib/python3.4/site-packages/rnftools/mishmash/ArtIllumina.py", line 142, in create_fq                           
  File "/usr/local/lib/python3.4/site-packages/smbl/utils/shell.py", line 20, in shell                                             
Removing output files of failed job 43 since they might be corrupted:                                                              
miseq1/001/tmp.1.sam, miseq1/001/tmp.1.corrected.sam, miseq1/001/tmp.1.fq                                                          
Will exit after finishing currently running jobs.                                                                                  
Exiting because a job execution failed. Look above for error message        

There's no issue using another simulator like DwgSim, it seems to just be an ArtIllumina thing. reference.fa consists of about a 1000 80mers. I'm wondering if this may be the problem?

karel-brinda commented 8 years ago

Hi, thank you for reporting this problem. Unfortunately, I cannot reproduce it. Could you put here a link to the reference file? Which OS do you use? Could you call snakemake -p to see all shell commands?

astewart-twist commented 8 years ago

Thanks for the snakemake -p tip @karel-brinda , I'll try that first. (mostly just looking for a way to get more troubleshooting information here).

The reference file is just random 81mers (I mistakenly wrote read_length_1=100 above). I can post an example:

>seq_1                                                                                                                              
ACATCATGCATGACATGTACGTGAGTATACACATGTGTGATCGCTCGCGCGCATCTCGTA                                                                       
TGCCGTCTTCTGCTTGAAAAA                                                                                                              
>seq_2                                                                                                                              
TGCGCGCACTACGCTACGACTGCGCAGCACTCGTATGTGTAGAGATATATATATCTCGTA                                                                       
TGCCGTCTTCTGCTTGAAAAA                                                                                                              
>seq_3                                                                                                                              
ATGCGTGTATCGCATCATACTACGAGCAGTGCTCTGTCTGTAGACGACACAGATCTCGTA                                                                       
TGCCGTCTTCTGCTTGAAAAA                                                                                                              
>seq_4                                                                                                                              
TGCTCGACTCACTGTCATGTAGAGCAGACATCTGTCGATCGTACGTACAGAGATCTCGTA                                                                       
TGCCGTCTTCTGCTTGAAAAA                                                                                                              
>seq_5                                                                                                                              
ACACGTAGCTGTCTATGTCGCAGCTGTCACTGTGTCATGAGTAGCAGACACAATCTCGTA                                                                       
TGCCGTCTTCTGCTTGAAAAA                                                                                                              
>seq_6                                                                                                                              
ACGCACTCGATCTCTGACACGAGTATGATAGTATCTACGTAGAGTGTCGCGCATCTCGTA                                                                       
TGCCGTCTTCTGCTTGAAAAA                                                                                                              
>seq_7                                                                                                                              
CATCGACGATGTCTCTGCGACTGTGAGCGAGCTGACTGCGACACTATATATAATCTCGTA                                                                       
TGCCGTCTTCTGCTTGAAAAA                                                                                                              
>seq_8                                                                                                                              
ATCTCGTGATGATCACATGCGCGCAGCGAGATATCTGAGCTACGACGTCTATATCTCGTA                                                                       
TGCCGTCTTCTGCTTGAAAAA                                                                                                              
>seq_9                                                                                                                              
GCGCAGCTGCGATGCTCTCTATATCTGATCTCATCGATGTCACAGAGAGAGAATCTCGTA                                                                       
TGCCGTCTTCTGCTTGAAAAA                                                                                                              
>seq_10                                                                                                                             
CGTGTCTGTACTCACGAGCATAGAGTCGACTCATGTACGCATAGATCAGCGTATCTCGTA                                                                       
TGCCGTCTTCTGCTTGAAAAA                                          

By the way, I also created a Dockerfile to build a docker container with rnftools installed in it. Very useful easy installation and guaranteeing identical run time environments. I'll send over a pull-request if you like.

astewart-twist commented 8 years ago

I figured out what I was doing wrong with the verbose output. Thanks again!

karel-brinda commented 8 years ago

@astewart-twist What was the reason of the problems? Btw. I am working on a new big release of RNFtools and I already corrected several problems in git master.

astewart-twist commented 8 years ago

@karel-brinda I was trying to tell Art to simulate reads of length 81 from a list of reference sequences of length 80 (off-by-one error in my sequence randomization). Curious why DwgSim didn't complain though!

Sounds great re:release! I submitted a pull-request with a Dockerfile that will deploy a functioning rnftools environment based on rnftools from pip, but it could be easily modified to build the docker image from source.

astewart-twist commented 8 years ago

@karel-brinda Actually, I went back and intentionally reproduced the error (with snakemake -p) and I believe there may be an issue with the call to ArtIllumina. I set the size of my reference sequences to 52 and told ArtIlumina to simulate 52nt long reads.

Multiple include of /usr/local/lib/python3.4/site-packages/smbl/include_all.snake ignored                                          
Provided cores: 8                                                                                                                  
Rules claiming more threads will be scaled down. 
Job counts:                                                                                                                        
        count   jobs                                                                                                               
        1       20                                                                                                                 
        1       45                                                                                                                 
        1       46                                                                                                                 
        1       9                                                                                                                  
        1       basic                                                                                                              
        5                                                                                                                          
rule 9:                                                                                                                            
        input: /root/.smbl/bin/bwa, reference.fa                                                                                   
        output: reference.fa.amb, reference.fa.ann, reference.fa.bwt, reference.fa.pac, reference.fa.sa                            
rule 46:                                                                                                                           
        input: /root/.smbl/bin/art_illumina, /root/.smbl/bin/samtools, reference.fa, reference.fa.fai                              
        output: reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq                  
[bwa_index] Pack FASTA... 0.00 sec                                                                                                 
[bwa_index] Construct BWT for the packed sequence...                                                                               
[bwa_index] 0.00 seconds elapse.                                                                                                   
**[bwa_index] Update BWT... Error: the number of bases is not equal to the number of quality scores!                                 
qual size: 52,  read len: 51      **                                                                                                 
0.00 sec                                                                                                                           
[bwa_index] Pack forward-only FASTA... Error in job 46 while creating output files reads/001/_final_reads.fq, reads/001/tmp.1.sam, 
reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq.                                                                                 
RuleException:                                                                                                                     
CalledProcessError in line 52 of /usr/local/lib/python3.4/site-packages/rnftools/mishmash/mishmash.snake:                          
Command '/root/.smbl/bin/art_illumina -sam -na -i "reference.fa" -l 52 -rs 1 -f 450.63782584581253 -o "reads/001/tmp.1" -ir 0.0015 
-dr 0.0003 > /dev/null' returned non-zero exit status 1                                                                            
  File "/usr/local/lib/python3.4/site-packages/rnftools/mishmash/ArtIllumina.py", line 141, in create_fq                           
  File "/usr/local/lib/python3.4/site-packages/smbl/utils/shell.py", line 20, in shell                                             
Removing output files of failed job 46 since they might be corrupted:                                                              
reads/001/tmp.1.sam, reads/001/tmp.1.fq                                                                                            
0.00 sec                                                                                                                           
[bwa_index] Construct SA from BWT and Occ... 0.00 sec                                                                              
[main] Version: 0.7.13-r1134-dirty                                                                                                 
[main] CMD: /root/.smbl/bin/bwa index reference.fa                                                                                 
[main] Real time: 0.390 sec; CPU: 0.008 sec                                                                                        
Will exit after finishing currently running jobs.                                                                                  
1 of 5 steps (20%) done                                                                                                            
Will exit after finishing currently running jobs.                                                                                  
Exiting because a job execution failed. Look above for error message      

The key is this line: [bwa_index] Update BWT... Error: the number of bases is not equal to the number of quality scores! qual size: 52, read len: 51

Is it possible that ART is simulating 51nt long reads instead of 52nt long reads for some reason?

astewart-twist commented 8 years ago

After considerable testing, I think this issue might be related to increasing deletion rates to Art. Given the fairly short reference sequences I'm using, maybe it has something to do with deletions at the ends causing a mismatch in the number of bases and quality scores?

karel-brinda commented 8 years ago

So if I understand it correctly, it is a bug of Art, not RNFtools? In such a case, it should be reported to Weichun Huang (whduke@gmail.com). Btw. I already do some corrections of produced SAM files because they are not valid.

What we can do with this bug in RNFtools is to catch the exception and write an error message saying that the simulator failed from some reason.

astewart-twist commented 8 years ago

If this is a bug with Art, then it may already be fixed. I noticed that the version being run through smbl is rather old:

ART_Illumina (2008-2015)                                                                               
Q Version 2.3.7 (Mar 19, 2015)          

A recent release on the Art website notes the following change:

"fixed a bug related to the crash when simulating Illumina reads with a high error rate"

I can try to replace the version in the smbl directory with that latest release and see if that clears up the bug.

karel-brinda commented 8 years ago

Hello @astewart-twist , did you try it with the last version of ART (you can just rewrite the program in ˜/.smbl/bin)? If it solves the problem, I will update it in SMBL.

astewart-twist commented 8 years ago

Hi @karel-brinda. I tried to do just that but got an error.

snakemake -p                              
mkdir -p "/root/.smbl/bin" "/root/.smbl/fa" "/root/.smbl/src"                                                                       
Multiple include of /usr/local/lib/python3.4/site-packages/smbl/include_all.snake ignored                                           
Provided cores: 1                                                                                                                   
Rules claiming more threads will be scaled down.                                                                                    
Job counts:                                                                                                                         
        count   jobs                                                                                                                
        1       37                                                                                                                  
        1       45                                                                                                                  
        1       46                                                                                                                  
        1       6                                                                                                                   
        1       basic                                                                                                               
        5                                                                                                                           
rule 46:                                                                                                                            
        input: /root/.smbl/bin/art_illumina, /root/.smbl/bin/samtools, reference.fa, reference.fa.fai                               
        output: reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq                   
/root/.smbl/bin/art_illumina -sam -na -i "reference.fa" -l 51 -rs 1 -f 441.9717138103161 -o "reads/001/tmp.1" -ir 0.005 -dr 0.005 > 
/dev/null                                                                                                                           
cat "reads/001/tmp.1.sam" | grep -v ^@ | "/root/.smbl/bin/samtools" view -h -T "reference.fa" - > "reads/001/tmp.1.corrected.sam"   
Error in job 46 while creating output files reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001
/tmp.1.fq.                                                                                                                          
RuleException:                                                                                                                      
TypeError in line 52 of /usr/local/lib/python3.4/site-packages/rnftools/mishmash/mishmash.snake:                                    
_open() takes at least 1 positional argument (0 given)                                                                              
  File "/usr/local/lib/python3.4/site-packages/rnftools/mishmash/ArtIllumina.py", line 153, in create_fq                            
  File "/usr/local/lib/python3.4/site-packages/rnftools/mishmash/Source.py", line 197, in recode_sam_reads                          
Removing output files of failed job 46 since they might be corrupted:                                                               
reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq                                   
Will exit after finishing currently running jobs.                                                                                   
Exiting because a job execution failed. Look above for error message

Not sure exactly what that's saying but my guess would be a slight change in the ART output files? (maybe he fixed the sam file issue?)

karel-brinda commented 8 years ago

Which version of PySam do you use? A similar error was causing problems several months ago, see https://github.com/pysam-developers/pysam/issues/182.

astewart-twist commented 8 years ago

Ah, yes, I'm intentionally using an older version of pysam (0.8.4) due to a compatibility issue with pysamstats. I just upgraded to pysam 0.9 and the above snakemake (with art_illumina 2.5) appears to be running beyond the earlier error. Fingers crossed the high error rate bug in this version of art_illumina fixes the original issue!

karel-brinda commented 8 years ago

If you have such problems with different versions of Python packages, it may be a good solution to use Conda environments: http://conda.pydata.org/docs/using/envs.html

astewart-twist commented 8 years ago

Yeah, I generally use virtualenvs, but same thing. I don't really need to have pysamstats/pysam=0.8.4 in the same environment as rnftools, so no big deal. I just removed it.

I had hoped that the new ART simulator was just taking awhile to run but it looks like it's just hanging:

$ snakemake -p
mkdir -p "/root/.smbl/bin" "/root/.smbl/fa" "/root/.smbl/src"                                                                       
Multiple include of /usr/local/lib/python3.4/site-packages/smbl/include_all.snake ignored                                           
Provided cores: 1                                                                                                                   
Rules claiming more threads will be scaled down.                                                                                    
Job counts:                                                                                                                         
        count   jobs                                                                                                                
        1       33                                                                                                                  
        1       45                                                                                                                  
        1       46                                                                                                                  
        1       7                                                                                                                   
        1       basic                                                                                                               
        1       index_fasta                                                                                                         
        6                                                                                                                           
Creating index for a FASTA file (reference.fa).                                                                                     

                        (                                                                                                           
                                "/root/.smbl/bin/samtools" faidx "reference.fa"                                                     
                        ) > /dev/null                                                                                               

1 of 6 steps (17%) done                                                                                                             
rule 46:                                                                                                                            
        input: /root/.smbl/bin/art_illumina, /root/.smbl/bin/samtools, reference.fa, reference.fa.fai                               
        output: reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq                   
/root/.smbl/bin/art_illumina -sam -na -i "reference.fa" -l 51 -rs 1 -f 441.9717138103161 -o "reads/001/tmp.1" -ir 0.005 -dr 0.005 > 
/dev/null                                                                                                                           
cat "reads/001/tmp.1.sam" | grep -v ^@ | "/root/.smbl/bin/samtools" view -h -T "reference.fa" - > "reads/001/tmp.1.corrected.sam"   
Removing output files of failed job 46 since they might be corrupted:                                                               
reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq                                   

Bummer.