Closed GoogleCodeExporter closed 8 years ago
The user should be able to input various reference files to breseq. They may
want to do this if they have features in one file and a reference sequence in
the other or have different features scattered across files.
Implement and test for following cases:
A duplicated in cAnnotedSequence should not be created in any of these cases.
1)PASS: .fasta + .gff3 file. The .gff3 file should only contain features.
2)PASS with warning: .fasta + .gff3 file. The .gff3 file will have an identical
sequence to .fasta.
3)PASS with warning: .gff3 + .gff3 file. Both identical except they have miss
labeled seq_ids... ie one has REL606 and the other REL606.1.
4)PASS with warning: .gff3 + gff3 file. Both identical except one file has some
additional features.
5)PASS with warning: .gff3 + gff3 file. Both identical except one file has some
additional features.
6)PASS with warning: .gbk + gbk file. Both identical except one file has some
additional features.
7)FAIL: .gff3 + .gff3. Both have identical features and labels but completely
different sequences.
8)Goes without saying that files with different references should create their
own cAnnotatedSequence and proceed through breseq as usual.
I imagine a complete string comparison for the reference sequences is
expensive. Some form of a random fragment testing may be a better option.
Original comment by Geoffrey...@gmail.com
on 11 Nov 2011 at 8:07
4)PASS with warning: .gbk + gff3 file. Both identical except one file has some
additional features.
Original comment by Geoffrey...@gmail.com
on 11 Nov 2011 at 8:10
[deleted comment]
Let's make it simpler:
It's fine to make it give an error and exit if two different files provide
either (1) sequence OR (2) features for the same seq_id. Thus, it's ok to use a
.fasta + .gff3 if the second contains only features, but a failure if the
second contains the FASTA portion. Any combination of .gbk and .gff will cause
an error.
Original comment by jeffrey....@gmail.com
on 11 Nov 2011 at 9:16
The simple version as posted in comment 4 is now implemented. Program will
give an error if any file tries to load information into a seq_id that has
already had that type of information loaded.
This is done by checking to see if the features list of a sequence has grown,
or if the sequence itself was of non-zero length and we tried to load a
FASTA,GFF3 with FASTA info, or GBK with sequence info.
Original comment by MDStr...@gmail.com
on 15 Nov 2011 at 9:34
Original issue reported on code.google.com by
jeffrey....@gmail.com
on 22 Aug 2011 at 2:50