katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
123 stars 65 forks source link

EcOH.fasta has 47 duplicate sequences & some inconsistences #64

Closed tseemann closed 8 years ago

tseemann commented 8 years ago

I ran this command to check the database:

cd-hit-est -d 0 -i EcOH.fasta -c 1 -g 1 -o cdhit

and discovered 47 duplicate sequences (and some subsequences), including some which have confounding O/H types - here are some examples:

0       1251nt, >9__wzy__wzy-O123-Gp5__370... *
1       1251nt, >9__wzy__wzy-O123-Gp5__371... at +/100.00%
2       1251nt, >9__wzy__wzy-O123-Gp5__372... at +/100.00%
3       1251nt, >9__wzy__wzy-O186-Gp5__454... at +/100.00%

0       1230nt, >8__wzx__wzx-O153-Gp11__191... *
1       1230nt, >8__wzx__wzx-O178-Gp11__223... at +/100.00%

0       1071nt, >8__wzx__wzx-O28ac-Gp2__250... at +/100.00%
1       1071nt, >8__wzx__wzx-O42-Gp2__266... at +/100.00%
2       1230nt, >8__wzx__wzx-O42-Gp2__267... *

0       1221nt, >8__wzx__wzx-O118-Gp3__140... *
1       1221nt, >8__wzx__wzx-O118-Gp3__141... at +/100.00%
2       1221nt, >8__wzx__wzx-O151-Gp3__189... at +/100.00%

0       1149nt, >9__wzy__wzy-O129-Gp10__383... *
1       1149nt, >9__wzy__wzy-O13-Gp10__384... at +/100.00%
2       1149nt, >9__wzy__wzy-O135-Gp10__390... at +/100.00%
3       1149nt, >9__wzy__wzy-O135-Gp10__391... at +/100.00%

0       1053nt, >9__wzy__wzy-O111__351... *
1       1053nt, >9__wzy__wzy-O7__524... at +/100.00%

0       1074nt, >9__wzy__wzy-O153-Gp11__412... at +/100.00%
1       1098nt, >9__wzy__wzy-O178-Gp11__444... *

0       1380nt, >9__wzy__wzy-O28ac-Gp2__471... *
1       1380nt, >9__wzy__wzy-O42-Gp2__487... at +/100.00%
2       1380nt, >9__wzy__wzy-O42-Gp2__488... at +/100.00%

0       1332nt, >9__wzy__wzy-O17-Gp9__434... *
1       1332nt, >9__wzy__wzy-O44-Gp9__490... at +/100.00%
2       1332nt, >9__wzy__wzy-O77-Gp9__531... at +/100.00%

0       1191nt, >9__wzy__wzy-O18-Gp12__446... *
1       1146nt, >9__wzy__wzy-O18ac__457... at +/100.00%
rrwick commented 8 years ago

Done! On the master branch, anyway... a new release with the fix included should be coming soon.