Suffers from bizarre bug in BioPython

In rare cases, the fasta headers in the annotated output can lack one of the fields due to a seriously bizarre bug in BioPython's SeqIO.write() function.

This occurs if the sequence's length happens to be the same as the sequence's name. In this case the description DiscoverY generates, which starts with the length, is mis-interpreted inside SeqIO.write() as including the sequence name. And SeqIO.write() does you the 'favor' of removing that duplication.

This obviously can only happen if the contig names are numbers. Unfortunately for me the output of whatever assembler create my contigs file does use numbers for names. And one of them happened to match the sequence length.

Why this is a problem is I was attempting to automatically convert the annotations into a table that I could process with other tools (e.g. R). But the table can't be correctly parsed due to the favor BioPython has done.

The only useful workaround I can see is that users should be warned (in the README) that their sequence names shouldn't be numbers.

makovalab-psu / DiscoverY

Suffers from bizarre bug in BioPython #8