shane-c-lawson / breseq

Automatically exported from code.google.com/p/breseq
Other
0 stars 0 forks source link

Protect against common command-line errors #19

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
1) Provide feedback (and fail) if the same reference sequence is provided twice 
at the command line.

Original issue reported on code.google.com by jeffrey....@gmail.com on 22 Aug 2011 at 2:50

GoogleCodeExporter commented 8 years ago
The user should be able to input various reference files to breseq. They may 
want to do this if they have features in one file and a reference sequence in 
the other or have different features scattered across files. 

Implement and test for following cases:

A duplicated in cAnnotedSequence should not be created in any of these cases.

1)PASS: .fasta + .gff3 file. The .gff3 file should only contain features.
2)PASS with warning: .fasta + .gff3 file. The .gff3 file will have an identical 
sequence to .fasta. 
3)PASS with warning: .gff3 +  .gff3 file. Both identical except they have miss 
labeled seq_ids... ie one has REL606 and the other REL606.1. 
4)PASS with warning: .gff3 + gff3 file. Both identical except one file has some 
additional features.
5)PASS with warning: .gff3 + gff3 file. Both identical except one file has some 
additional features.
6)PASS with warning: .gbk + gbk file. Both identical except one file has some 
additional features.
7)FAIL: .gff3 + .gff3. Both have identical features and labels but completely 
different sequences.

8)Goes without saying that files with different references should create their 
own cAnnotatedSequence and proceed through breseq as usual.

I imagine a complete string comparison for the reference sequences is 
expensive. Some form of a random fragment testing may be a better option.

Original comment by Geoffrey...@gmail.com on 11 Nov 2011 at 8:07

GoogleCodeExporter commented 8 years ago
4)PASS with warning: .gbk + gff3 file. Both identical except one file has some 
additional features.

Original comment by Geoffrey...@gmail.com on 11 Nov 2011 at 8:10

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Let's make it simpler:

It's fine to make it give an error and exit if two different files provide 
either (1) sequence OR (2) features for the same seq_id. Thus, it's ok to use a 
.fasta + .gff3 if the second contains only features, but a failure if the 
second contains the FASTA portion. Any combination of .gbk and .gff will cause 
an error.

Original comment by jeffrey....@gmail.com on 11 Nov 2011 at 9:16

GoogleCodeExporter commented 8 years ago
The simple version as posted in comment 4 is now implemented.  Program will 
give an error if any file tries to load information into a seq_id that has 
already had that type of information loaded.

This is done by checking to see if the features list of a sequence has grown, 
or if the sequence itself was of non-zero length and we tried to load a 
FASTA,GFF3 with FASTA info, or GBK with sequence info.

Original comment by MDStr...@gmail.com on 15 Nov 2011 at 9:34