Closed GoogleCodeExporter closed 8 years ago
I cannot reproduce the error, neither on unix nor on windows. Usually when the
genomic sequence is read, the fata file with the genomic sequence is scanned
from
the beginning until the first line starting with ">" is found. From there on,
the
rest of the lines are interpreted as sequence lines. They have to use all the
same
line break and line length (currently unchecked).
Check whether:
- you do not use a multi-fasta file
- the FASTA tag in the genomic sequence file does not contain a line break
- the sequence lines of the genomic sequence are of same length, including line
break
If all that holds, attach here *the head* (first 10 lines or so) of this
chromosomes
file. I implemented an additional dialog in the GUI that should provide a bit
more
information about what exactly led to reading the string, obviously a substring
of
the tag.
Remains open until further information.
Original comment by gmicha@gmail.com
on 31 Aug 2009 at 1:21
i have a theory what goes wrong and in case it does not apply a project that
should
allow error reproduction. the project is attached to this comment.
my theroy:
if a transcript's chromstart is close to the beginning of the chromosome and TSS
pulls the chromstart before the the chromosome's beginning (<0) then the
compiler
starts reading in the fasta header line.
the complete error message:
java.lang.RuntimeException: Problems reading BED object sequence:
I -115 167 I:50-500W:MARV01:3:255:-163:69:-163:69 0 -..
2215,18 0,264
Problems reading sequence I: pos 0, len 215
Complement: unknown symbol > in >I dna:chromosome chromosome:SGD1.01:I:1:230208:1
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCCACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAAC
at genome.io.A.C.F(Unknown Source)
at genome.sequencing.rnaseq.simulation.A.A(Unknown Source)
at genome.sequencing.rnaseq.simulation.A.g(Unknown Source)
at genome.sequencing.rnaseq.simulation.A.run(Unknown Source)
at genome.sequencing.rnaseq.simulation.A.C.run(Unknown Source)
at java.lang.Thread.run(Thread.java:619)
at genome.sequencing.rnaseq.simulation.A.H$_A.run(Unknown Source)
Original comment by mun...@gmail.com
on 31 Aug 2009 at 9:16
Attachments:
Correct, that meets also what I suspected. Your gene is 50nt away from the
chromosome start, and the exponentially random TSS function rolled a start that
has
been further away than 50nt upstream of the annotated start.
There are at least two possibilities to fix that: (i) forbid impossible start
positions at the time the expressed molecules are varied, (ii) fix the
sequencing of
areas outside of the genomic sequence. A third one would be to add a
sophisticated
promotor model, but this is currently out of the time scope.
Despite the little we know about TSS variations, a polymerase will for sure not
bind
outside of the chromosome which speaks to correct by fixing (i). However, we
would
face similar irrealistic scenarios if a TSS variation falls before the TATA or
other
important signals of a gene that is located around the center of the
chromosome. On
the other hand - albeit I do not know how widely this holds across different
species
- but I remember that human chromosome ends are not very defined as the
telomerase
enzyme adds a varying number of nucleotides to the ends. That would justify
that one
could simply "extend" the chromosome by 'N' or 'n' characters in case the TSS
falls
outside. Obviously, the mitochondrial genome is actually circular, and one
should
take the nucleotides from the respective sequence end.. in an optimal world we
will
have all that in our simulations one day.
My current feeling would favor the correction of the chromosome ends to meet
"transcription out of range". That is most consistent with the current level of
abstraction of the whole simulation. Are more refined model is desireable, but
only
if promotor locations and extensions are considered generally. Please tell me
whether this is an acceptable solution for you.
Within this respect, note that the parameters (and default values)
DEF_POLYA_VAR= true
DEF_TSS_VAR= true
DEF_TSS_MEAN= 25.0
DEF_POLYA_SHAPE= 2.0
DEF_POLYA_SCALE= 300.0
that de/activate variation of the annotated start end of the transcript
(exponential
mean, respectively shape and scale of a Weibull distribution).
Original comment by gmicha@gmail.com
on 31 Aug 2009 at 9:58
it is absolutely acceptable yes. Assuming that the vast majority of loci is
sufficiently far away from the chromosome start to not cause any problem, for my
project it is even sufficient to remove all transcripts from critical loci from
the
annotation.
Original comment by mun...@gmail.com
on 4 Sep 2009 at 2:28
Problem already fixed since build 20090831.
Original comment by gmicha@gmail.com
on 14 May 2010 at 6:57
Original issue reported on code.google.com by
mun...@gmail.com
on 29 Aug 2009 at 1:26