iTaxoTools / TaxI2-legacy

Calculates genetic differences between DNA sequences
GNU General Public License v3.0
0 stars 0 forks source link

Make Taxi2 more robust against deviations in input format #21

Closed mvences closed 3 years ago

mvences commented 3 years ago

I have run some tests of Taxi2, and the program seems to perform almost perfectly. I will have only few minor suggestions of how to change the output files, mainly changing or adding some text for better explanation. Also I will add one further issue regarding additional variants of the graphs.

However, the main point that should be improved is making the program more robust against differently shaped input files. Specifically, I have tried to run it with an input file downloaded from Genbank, and needed to do quite some manual transformations and corrections before it was running.

I here attach a ZIP file with various files that yielded an error message and no results (two in tab format, and one in Genbank format). And I also add the modified file that worked smoothly in the end.

Since I am not totally sure at which point the problem with the files occurred, I here recapitulate some important points:

Taxi2test_notworking.zip Taxi2test3_working.tab.txt

necrosovereign commented 3 years ago

I added an instruction to drop duplicate rows and it seems to have fixed the problem.

The program should have no problem dealing with long names and unusual characters.

Currently, while loading, the program converts columns' names to lowercase, renames 'organism' to 'species' and loads the columns 'seqid', 'specimen_voucher', 'species' and 'sequence'. I will add renaming 'specimen_identifier' to 'specimen_voucher' and columns' names with 'sequence' to 'sequence'