Closed mvences closed 3 years ago
I added an instruction to drop duplicate rows and it seems to have fixed the problem.
The program should have no problem dealing with long names and unusual characters.
Currently, while loading, the program converts columns' names to lowercase, renames 'organism' to 'species' and loads the columns 'seqid', 'specimen_voucher', 'species' and 'sequence'. I will add renaming 'specimen_identifier' to 'specimen_voucher' and columns' names with 'sequence' to 'sequence'
I have run some tests of Taxi2, and the program seems to perform almost perfectly. I will have only few minor suggestions of how to change the output files, mainly changing or adding some text for better explanation. Also I will add one further issue regarding additional variants of the graphs.
However, the main point that should be improved is making the program more robust against differently shaped input files. Specifically, I have tried to run it with an input file downloaded from Genbank, and needed to do quite some manual transformations and corrections before it was running.
I here attach a ZIP file with various files that yielded an error message and no results (two in tab format, and one in Genbank format). And I also add the modified file that worked smoothly in the end.
Since I am not totally sure at which point the problem with the files occurred, I here recapitulate some important points:
The program should simply ignore in a tab file all those columns/fields that are not needed or recognized, before or after the "sequence" field. The program should just look for "seqid", "species" or "organism", "specimen_identifier" and "sequence", and ignore everything else. (probably it is already doing it?)
Also if dealing with a Genbank file, the program should just extract only the information from these fields, and ignore everything else (probably it is already doing it?)
Similar to "DNAconvert", it will be good to not only allow (in a tab file) for differenzt combinations of upper and lower case but also allow for some misspellings, for instance if there is no column "sequence" but a "column "16S sequence", then it should interpret this column as being equivalent to "sequence" and try to work with it.
As a main point that probably is causing the problems here: As you will see, in the problematic tab files, the "seqid" column contains extremely long and complex names, and once I replaced these with simple snames such as "specimen1" it worked fine. So, probably what will be needed is some autocorrect routine that simplifies the "seqID", replacing all problem characters and cutting it to a reasonable length of maybe 30 characters at most, and making the respective seqids unique by adding a number at the end, similar to what is happening with conversion to Phylip in DNAconvert.
Finally, I am not sure which are the problems with the Genbank file, but maybe also here, once the "seqid" is autocorrected and shortened in length, the program will be able to process the file.
Taxi2test_notworking.zip Taxi2test3_working.tab.txt