Closed astewart-twist closed 7 years ago
Hi, thank you for reporting this problem. Unfortunately, I cannot reproduce it. Could you put here a link to the reference file? Which OS do you use? Could you call snakemake -p
to see all shell commands?
Thanks for the snakemake -p
tip @karel-brinda , I'll try that first. (mostly just looking for a way to get more troubleshooting information here).
The reference file is just random 81mers (I mistakenly wrote read_length_1=100
above). I can post an example:
>seq_1
ACATCATGCATGACATGTACGTGAGTATACACATGTGTGATCGCTCGCGCGCATCTCGTA
TGCCGTCTTCTGCTTGAAAAA
>seq_2
TGCGCGCACTACGCTACGACTGCGCAGCACTCGTATGTGTAGAGATATATATATCTCGTA
TGCCGTCTTCTGCTTGAAAAA
>seq_3
ATGCGTGTATCGCATCATACTACGAGCAGTGCTCTGTCTGTAGACGACACAGATCTCGTA
TGCCGTCTTCTGCTTGAAAAA
>seq_4
TGCTCGACTCACTGTCATGTAGAGCAGACATCTGTCGATCGTACGTACAGAGATCTCGTA
TGCCGTCTTCTGCTTGAAAAA
>seq_5
ACACGTAGCTGTCTATGTCGCAGCTGTCACTGTGTCATGAGTAGCAGACACAATCTCGTA
TGCCGTCTTCTGCTTGAAAAA
>seq_6
ACGCACTCGATCTCTGACACGAGTATGATAGTATCTACGTAGAGTGTCGCGCATCTCGTA
TGCCGTCTTCTGCTTGAAAAA
>seq_7
CATCGACGATGTCTCTGCGACTGTGAGCGAGCTGACTGCGACACTATATATAATCTCGTA
TGCCGTCTTCTGCTTGAAAAA
>seq_8
ATCTCGTGATGATCACATGCGCGCAGCGAGATATCTGAGCTACGACGTCTATATCTCGTA
TGCCGTCTTCTGCTTGAAAAA
>seq_9
GCGCAGCTGCGATGCTCTCTATATCTGATCTCATCGATGTCACAGAGAGAGAATCTCGTA
TGCCGTCTTCTGCTTGAAAAA
>seq_10
CGTGTCTGTACTCACGAGCATAGAGTCGACTCATGTACGCATAGATCAGCGTATCTCGTA
TGCCGTCTTCTGCTTGAAAAA
By the way, I also created a Dockerfile to build a docker container with rnftools installed in it. Very useful easy installation and guaranteeing identical run time environments. I'll send over a pull-request if you like.
I figured out what I was doing wrong with the verbose output. Thanks again!
@astewart-twist What was the reason of the problems? Btw. I am working on a new big release of RNFtools and I already corrected several problems in git master.
@karel-brinda I was trying to tell Art to simulate reads of length 81 from a list of reference sequences of length 80 (off-by-one error in my sequence randomization). Curious why DwgSim didn't complain though!
Sounds great re:release! I submitted a pull-request with a Dockerfile that will deploy a functioning rnftools environment based on rnftools from pip, but it could be easily modified to build the docker image from source.
@karel-brinda Actually, I went back and intentionally reproduced the error (with snakemake -p) and I believe there may be an issue with the call to ArtIllumina. I set the size of my reference sequences to 52 and told ArtIlumina to simulate 52nt long reads.
Multiple include of /usr/local/lib/python3.4/site-packages/smbl/include_all.snake ignored
Provided cores: 8
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 20
1 45
1 46
1 9
1 basic
5
rule 9:
input: /root/.smbl/bin/bwa, reference.fa
output: reference.fa.amb, reference.fa.ann, reference.fa.bwt, reference.fa.pac, reference.fa.sa
rule 46:
input: /root/.smbl/bin/art_illumina, /root/.smbl/bin/samtools, reference.fa, reference.fa.fai
output: reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq
[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
**[bwa_index] Update BWT... Error: the number of bases is not equal to the number of quality scores!
qual size: 52, read len: 51 **
0.00 sec
[bwa_index] Pack forward-only FASTA... Error in job 46 while creating output files reads/001/_final_reads.fq, reads/001/tmp.1.sam,
reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq.
RuleException:
CalledProcessError in line 52 of /usr/local/lib/python3.4/site-packages/rnftools/mishmash/mishmash.snake:
Command '/root/.smbl/bin/art_illumina -sam -na -i "reference.fa" -l 52 -rs 1 -f 450.63782584581253 -o "reads/001/tmp.1" -ir 0.0015
-dr 0.0003 > /dev/null' returned non-zero exit status 1
File "/usr/local/lib/python3.4/site-packages/rnftools/mishmash/ArtIllumina.py", line 141, in create_fq
File "/usr/local/lib/python3.4/site-packages/smbl/utils/shell.py", line 20, in shell
Removing output files of failed job 46 since they might be corrupted:
reads/001/tmp.1.sam, reads/001/tmp.1.fq
0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.13-r1134-dirty
[main] CMD: /root/.smbl/bin/bwa index reference.fa
[main] Real time: 0.390 sec; CPU: 0.008 sec
Will exit after finishing currently running jobs.
1 of 5 steps (20%) done
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
The key is this line: [bwa_index] Update BWT... Error: the number of bases is not equal to the number of quality scores! qual size: 52, read len: 51
Is it possible that ART is simulating 51nt long reads instead of 52nt long reads for some reason?
After considerable testing, I think this issue might be related to increasing deletion rates to Art. Given the fairly short reference sequences I'm using, maybe it has something to do with deletions at the ends causing a mismatch in the number of bases and quality scores?
So if I understand it correctly, it is a bug of Art, not RNFtools? In such a case, it should be reported to Weichun Huang (whduke@gmail.com). Btw. I already do some corrections of produced SAM files because they are not valid.
What we can do with this bug in RNFtools is to catch the exception and write an error message saying that the simulator failed from some reason.
If this is a bug with Art, then it may already be fixed. I noticed that the version being run through smbl is rather old:
ART_Illumina (2008-2015)
Q Version 2.3.7 (Mar 19, 2015)
A recent release on the Art website notes the following change:
"fixed a bug related to the crash when simulating Illumina reads with a high error rate"
I can try to replace the version in the smbl directory with that latest release and see if that clears up the bug.
Hello @astewart-twist , did you try it with the last version of ART (you can just rewrite the program in ˜/.smbl/bin
)? If it solves the problem, I will update it in SMBL.
Hi @karel-brinda. I tried to do just that but got an error.
snakemake -p
mkdir -p "/root/.smbl/bin" "/root/.smbl/fa" "/root/.smbl/src"
Multiple include of /usr/local/lib/python3.4/site-packages/smbl/include_all.snake ignored
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 37
1 45
1 46
1 6
1 basic
5
rule 46:
input: /root/.smbl/bin/art_illumina, /root/.smbl/bin/samtools, reference.fa, reference.fa.fai
output: reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq
/root/.smbl/bin/art_illumina -sam -na -i "reference.fa" -l 51 -rs 1 -f 441.9717138103161 -o "reads/001/tmp.1" -ir 0.005 -dr 0.005 >
/dev/null
cat "reads/001/tmp.1.sam" | grep -v ^@ | "/root/.smbl/bin/samtools" view -h -T "reference.fa" - > "reads/001/tmp.1.corrected.sam"
Error in job 46 while creating output files reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001
/tmp.1.fq.
RuleException:
TypeError in line 52 of /usr/local/lib/python3.4/site-packages/rnftools/mishmash/mishmash.snake:
_open() takes at least 1 positional argument (0 given)
File "/usr/local/lib/python3.4/site-packages/rnftools/mishmash/ArtIllumina.py", line 153, in create_fq
File "/usr/local/lib/python3.4/site-packages/rnftools/mishmash/Source.py", line 197, in recode_sam_reads
Removing output files of failed job 46 since they might be corrupted:
reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
Not sure exactly what that's saying but my guess would be a slight change in the ART output files? (maybe he fixed the sam file issue?)
Which version of PySam do you use? A similar error was causing problems several months ago, see https://github.com/pysam-developers/pysam/issues/182.
Ah, yes, I'm intentionally using an older version of pysam (0.8.4) due to a compatibility issue with pysamstats. I just upgraded to pysam 0.9 and the above snakemake (with art_illumina 2.5) appears to be running beyond the earlier error. Fingers crossed the high error rate bug in this version of art_illumina fixes the original issue!
If you have such problems with different versions of Python packages, it may be a good solution to use Conda environments: http://conda.pydata.org/docs/using/envs.html
Yeah, I generally use virtualenvs, but same thing. I don't really need to have pysamstats/pysam=0.8.4 in the same environment as rnftools, so no big deal. I just removed it.
I had hoped that the new ART simulator was just taking awhile to run but it looks like it's just hanging:
$ snakemake -p
mkdir -p "/root/.smbl/bin" "/root/.smbl/fa" "/root/.smbl/src"
Multiple include of /usr/local/lib/python3.4/site-packages/smbl/include_all.snake ignored
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 33
1 45
1 46
1 7
1 basic
1 index_fasta
6
Creating index for a FASTA file (reference.fa).
(
"/root/.smbl/bin/samtools" faidx "reference.fa"
) > /dev/null
1 of 6 steps (17%) done
rule 46:
input: /root/.smbl/bin/art_illumina, /root/.smbl/bin/samtools, reference.fa, reference.fa.fai
output: reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq
/root/.smbl/bin/art_illumina -sam -na -i "reference.fa" -l 51 -rs 1 -f 441.9717138103161 -o "reads/001/tmp.1" -ir 0.005 -dr 0.005 >
/dev/null
cat "reads/001/tmp.1.sam" | grep -v ^@ | "/root/.smbl/bin/samtools" view -h -T "reference.fa" - > "reads/001/tmp.1.corrected.sam"
Removing output files of failed job 46 since they might be corrupted:
reads/001/_final_reads.fq, reads/001/tmp.1.sam, reads/001/tmp.1.corrected.sam, reads/001/tmp.1.fq
Bummer.
Running a very basic example from the documentaiton:
Then running
snakemake
I get the following.There's no issue using another simulator like DwgSim, it seems to just be an ArtIllumina thing.
reference.fa
consists of about a 1000 80mers. I'm wondering if this may be the problem?