sanger-pathogens / seroba

k-mer based Pipeline to identify the Serotype from Illumina NGS reads
https://sanger-pathogens.github.io/seroba/
Other
19 stars 16 forks source link

The "cd_cluster.tsv" created recently is different from the previous one #47

Closed abcdtree closed 4 years ago

abcdtree commented 4 years ago

cd_cluster_old.txt cd_cluster_new.txt

The cd_cluster_old.txt is the previous one. And the cd_cluster_new.txt is the new cd_cluster.tsv the program created when I tried to build another copy in my another device. Is this due to new version of KMC or Python3 I used in my new device?

The new cd_cluster.tsv looks like to have some problems and will make the seroba serotyping end with errors for some serotypes. Please let me know if you could recreate the problem and any solution?

Josh

eppinglen commented 4 years ago

Hi Josh,

thank you for reporting this problem. Do you know which serotypes are involved?

Best, Lennard

abcdtree commented 4 years ago

Hi Lennard,

I compared two version of cd_cluster files. And, New Serotypes: ['35D', '39X', 'alternative_aliB_NT', 'Swiss_NT', '10X', '11E', '06G', '06F'] Uncovered Old Serotypes: ['07F', '17F', '24B', '24F', '22F', '07A', '41F', '22A', '15F', '31', '23B1', '18F', '19C', '18B', '17A', '18C', '45', '16A']

I found this error when I run analysis on an isolate with "17F" serotype. It works well with previous version but raised an error with new version.

Josh

eppinglen commented 4 years ago

Hi Josh, I have been able to reproduce the error. This was related to new default parameter settings in new versions of ariba. Ariba set the maximum sequence length for non coding sequences to a maximum of 20kb by default, so that same serotypes were discarded during the database build process. I adapted this settings for SeroBA. I hope this will work fine for you. For more information please have a log at: #49.

Best, Lennard

abcdtree commented 4 years ago

Thanks, Lennard. I will try the new version.

Josh