makovalab-psu / DiscoverY

K-mer based classifier for Y-contig identification from Whole Genome Assemblies
MIT License
11 stars 5 forks source link

Suffers from bizarre bug in BioPython #8

Open rsharris opened 4 years ago

rsharris commented 4 years ago

In rare cases, the fasta headers in the annotated output can lack one of the fields due to a seriously bizarre bug in BioPython's SeqIO.write() function.

This occurs if the sequence's length happens to be the same as the sequence's name. In this case the description DiscoverY generates, which starts with the length, is mis-interpreted inside SeqIO.write() as including the sequence name. And SeqIO.write() does you the 'favor' of removing that duplication.

This obviously can only happen if the contig names are numbers. Unfortunately for me the output of whatever assembler create my contigs file does use numbers for names. And one of them happened to match the sequence length.

Why this is a problem is I was attempting to automatically convert the annotations into a table that I could process with other tools (e.g. R). But the table can't be correctly parsed due to the favor BioPython has done.

The only useful workaround I can see is that users should be warned (in the README) that their sequence names shouldn't be numbers.

rsharris commented 4 years ago

There is a workaround for this, which discoverY (and anything else using BioPython to write fasta) should employ. record.id should be included in description, like this: description=id=record.id + " " + str(length) + " " ...

See https://github.com/biopython/biopython/issues/2270 for some discussion.