simulator fasta compiler error

GoogleCodeExporter commented 8 years ago

For which program(s) you want a new feature?
(Simulator)

Which build of the program(s)?
20090824

What operating system you use?
(unix32, unix64)

Problem description:
The fasta compiler of the simulator writes error messages like:
Complement: unknown symbol >
Complement: unknown symbol I
on the console. it seems as if the fasta compiler ignores the fasta header
line in the chr*.fa files in the genome folder.

Original issue reported on code.google.com by mun...@gmail.com on 29 Aug 2009 at 1:26

GoogleCodeExporter commented 8 years ago

I cannot reproduce the error, neither on unix nor on windows. Usually when the 
genomic sequence is read, the fata file with the genomic sequence is scanned 
from 
the beginning until the first line starting with ">" is found. From there on, 
the 
rest of the lines are interpreted as sequence lines. They have to use all the 
same 
line break and line length (currently unchecked). 

Check whether:
- you do not use a multi-fasta file
- the FASTA tag in the genomic sequence file does not contain a line break
- the sequence lines of the genomic sequence are of same length, including line 
break

If all that holds, attach here *the head* (first 10 lines or so) of this 
chromosomes 
file. I implemented an additional dialog in the GUI that should provide a bit 
more 
information about what exactly led to reading the string, obviously a substring 
of 
the tag.

Remains open until further information.

Original comment by gmicha@gmail.com on 31 Aug 2009 at 1:21

Changed state: Started
Added labels: Simulator

GoogleCodeExporter commented 8 years ago

i have a theory what goes wrong and in case it does not apply a project that 
should
allow error reproduction. the project is attached to this comment.

my theroy:
if a transcript's chromstart is close to the beginning of the chromosome and TSS
pulls the chromstart before the the chromosome's beginning (<0) then the 
compiler
starts reading in the fasta header line.

the complete error message:
java.lang.RuntimeException: Problems reading BED object sequence:
        I       -115    167     I:50-500W:MARV01:3:255:-163:69:-163:69  0       -.. 
   2215,18   0,264
        Problems reading sequence I: pos 0, len 215
        Complement: unknown symbol > in >I dna:chromosome chromosome:SGD1.01:I:1:230208:1
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCCACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAAC
        at genome.io.A.C.F(Unknown Source)
        at genome.sequencing.rnaseq.simulation.A.A(Unknown Source)
        at genome.sequencing.rnaseq.simulation.A.g(Unknown Source)
        at genome.sequencing.rnaseq.simulation.A.run(Unknown Source)
        at genome.sequencing.rnaseq.simulation.A.C.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:619)
        at genome.sequencing.rnaseq.simulation.A.H$_A.run(Unknown Source)

Original comment by mun...@gmail.com on 31 Aug 2009 at 9:16

Attachments:

project.tar.bz

GoogleCodeExporter commented 8 years ago

Correct, that meets also what I suspected. Your gene is 50nt away from the 
chromosome start, and the exponentially random TSS function rolled a start that 
has 
been further away than 50nt upstream of the annotated start.

There are at least two possibilities to fix that: (i) forbid impossible start 
positions at the time the expressed molecules are varied, (ii) fix the 
sequencing of 
areas outside of the genomic sequence. A third one would be to add a 
sophisticated 
promotor model, but this is currently out of the time scope. 

Despite the little we know about TSS variations, a polymerase will for sure not 
bind 
outside of the chromosome which speaks to correct by fixing (i). However, we 
would 
face similar irrealistic scenarios if a TSS variation falls before the TATA or 
other 
important signals of a gene that is located around the center of the 
chromosome. On 
the other hand - albeit I do not know how widely this holds across different 
species 
- but I remember that human chromosome ends are not very defined as the 
telomerase 
enzyme adds a varying number of nucleotides to the ends. That would justify 
that one 
could simply "extend" the chromosome by 'N' or 'n' characters in case the TSS 
falls 
outside. Obviously, the mitochondrial genome is actually circular, and one 
should 
take the nucleotides from the respective sequence end.. in an optimal world we 
will 
have all that in our simulations one day.

My current feeling would favor the correction of the chromosome ends to meet 
"transcription out of range". That is most consistent with the current level of 
abstraction of the whole simulation. Are more refined model is desireable, but 
only 
if promotor locations and extensions are considered generally. Please tell me 
whether this is an acceptable solution for you. 

Within this respect, note that the parameters (and default values)

DEF_POLYA_VAR= true
DEF_TSS_VAR= true
DEF_TSS_MEAN= 25.0
DEF_POLYA_SHAPE= 2.0
DEF_POLYA_SCALE= 300.0

that de/activate variation of the annotated start end of the transcript 
(exponential 
mean, respectively shape and scale of a Weibull distribution).

Original comment by gmicha@gmail.com on 31 Aug 2009 at 9:58

GoogleCodeExporter commented 8 years ago

it is absolutely acceptable yes. Assuming that the vast majority of loci is
sufficiently far away from the chromosome start to not cause any problem, for my
project it is even sufficient to remove all transcripts from critical loci from 
the
annotation.

Original comment by mun...@gmail.com on 4 Sep 2009 at 2:28

GoogleCodeExporter commented 8 years ago

Problem already fixed since build 20090831.

Original comment by gmicha@gmail.com on 14 May 2010 at 6:57

Changed state: Fixed

sivarajankumar / fluxcapacitor

simulator fasta compiler error #24