sivarajankumar / fluxcapacitor

Automatically exported from code.google.com/p/fluxcapacitor
0 stars 0 forks source link

error when producing FASTA / FASTQ #62

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
Simulator 1.0 RC3

i am trying to simulate reads from the human genome (hg19), with gene 
annotations from ensembl (v60). The simulator tool works fine for generating 
reads without the fasta file. However, if i am using the default error model 76 
and enable the fasta output i get error messages looking like this:

sequencing Problems reading 20: 336933, -3> 64286035 into 52: null
check for the right species/genome version!
java.lang.IndexOutOfBoundsException
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(RandomAccessFile.java:338)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:397)
at fbi.genome.model.Graph.readSequence(Graph.java:1256)
at fbi.genome.model.bed.BEDobject2.readSequence(BEDobject2.java:251)
at 
fbi.genome.sequencing.rnaseq.simulation.Sequencer.createQSeq(Sequencer.java:565)
at 
fbi.genome.sequencing.rnaseq.simulation.Sequencer.access$1000(Sequencer.java:44)
at 
fbi.genome.sequencing.rnaseq.simulation.Sequencer$SequenceWriter.writeRead(Seque
ncer.java:888)
at 
fbi.genome.sequencing.rnaseq.simulation.Sequencer$Processor.process(Sequencer.ja
va:737)
at 
fbi.genome.sequencing.rnaseq.simulation.Sequencer.sequence(Sequencer.java:290)
at fbi.genome.sequencing.rnaseq.simulation.Sequencer.call(Sequencer.java:107)
at 
fbi.genome.sequencing.rnaseq.simulation.SimulationPipeline.call(SimulationPipeli
ne.java:351)
at 
fbi.genome.sequencing.rnaseq.simulation.SimulationPipeline.call(SimulationPipeli
ne.java:32)
at fbi.commons.flux.Flux.main(Flux.java:168)

The names of the chromosome fasta files are equal to the chromosome names in 
the gtf file.
Do you have an idea what is going wrong here?

Original issue reported on code.google.com by gmicha@gmail.com on 1 Sep 2011 at 12:22

GoogleCodeExporter commented 8 years ago
I have attached the parameter file i am using to simulate the reads.
best,
thomas

Original comment by T.Bonf...@googlemail.com on 2 Sep 2011 at 7:23

Attachments:

GoogleCodeExporter commented 8 years ago
bug when generating FASTA/FASTQ sequences occurs when read identifiers are 
sufficiently long. Ensembl transcript identifiers are comparatively long, and 
because the transcript identifier is part of the read ID / FASTA tag, the issue 
occurs in the given dataset. 

Original comment by gmicha@gmail.com on 2 Sep 2011 at 2:33

GoogleCodeExporter commented 8 years ago
Hi Micha,

thank you for your fast reply and analysis. However, I think the problem must 
be somewhere else. I have mapped the ensembl transcript ids to a set of integer 
values and replaced the original ids by them, but this doesn't help...

Cheers,
thomas

Original comment by T.Bonf...@googlemail.com on 5 Sep 2011 at 11:41

GoogleCodeExporter commented 8 years ago
Hi Thomas,

reconstructing the issue you described, I definitely came across a bug in the 
code--and removing the erroneous lines also made the problem disappear. The 
circumstances that provoked the error are difficult to predict in the general 
case--it was related to an overflow in a buffer used during FASTA/FASTQ 
creation; the length of the read identifier certainly had an influence on the 
aberrant behavior, therefore the length of ensembl identifiers was for me the 
closest explanation that we haven't noted a problem before. 

However, the example you provided works well for us with the bundle we just put 
in the download section: 

http://fluxcapacitor.googlecode.com/files/fbi.genome.simulator-1.0-RC4.tar.gz

Therefore I mark the issue as fixed, please notify me if you have contradicting 
information.

cheers, micha

Original comment by gmicha@gmail.com on 5 Sep 2011 at 12:41

GoogleCodeExporter commented 8 years ago
Hi Micha,

the update works, thank you for your help! :)

best,
t

Original comment by T.Bonf...@googlemail.com on 5 Sep 2011 at 2:43