phac-nml / sistr_cmd

SISTR (Salmonella In Silico Typing Resource) command-line tool
Apache License 2.0
25 stars 9 forks source link

Duplicated records in fliC DB. #31

Closed andersgs closed 4 years ago

andersgs commented 6 years ago

Hello.

I have a sample that I expected to be IV_44:z4z32, but sistr identified it as z4,z23,z32. When I checked the fliC.fasta DB I noticed that there were two identical entries (sequence wise) but one labeled with z4,z23,z32 and another with IV_44:z4z32.

While examing the DB, I found a few other potential issues. Full report below, with the cluster of sequences mentioned above detailed at the bottom.

Cheers. A.

db-check Report fliC

By agoncalves on 2018-11-22


Summary

Possible issues.

Breakdown


Cluster size distribution

Table: Distribution of cluster sizes (i.e., number of sequences).

count mean std min 25% 50% 75% max
622 1.15273 0.688636 1 1 1 1 9

Summary of duplicated IDs

Duplicate ID 11-2580|c

Table: Sequences with ID 11-2580|c

clusterid seqid length is_centroid match category
32 11-2580|c 1602 True 1 c
32 11-2580|c 1506 False 1 c

Category report

Distribution of categories by cluster of sequences.

Table: Distribution of categories by cluster of sequences.

count mean std min 25% 50% 75% max
622 1.01125 0.119858 1 1 1 1 3

Clusters with more than one category

Cluster 6

Title: Sequences in cluster 6

clusterid seqid length is_centroid match category
6 11|[f],g,[t] 1527 False 1 [f],g,[t]
6 355|[f],g,[t] 1527 False 1 [f],g,[t]
6 356|[f],g,[t] 1527 False 1 [f],g,[t]
6 386|f,g,t 1876 True 1 f,g,t

Cluster 24

Title: Sequences in cluster 24

clusterid seqid length is_centroid match category
24 147|g,m 1518 False 1 g,m
24 178|g,m,s 1518 False 1 g,m,s
24 211|g,m 1518 False 1 g,m
24 212|g,m 1518 False 1 g,m
24 213|g,m 1518 False 1 g,m
24 214|g,m 1518 False 1 g,m
24 388|g,m 1867 True 1 g,m
24 390|- 1518 False 1 -
24 393|- 1518 False 1 -

Cluster 151

Title: Sequences in cluster 151

clusterid seqid length is_centroid match category
151 143|f,g,s 1518 True 1 f,g,s
151 208|f,g,s 1518 False 1 f,g,s
151 209|f,g,s 1518 False 1 f,g,s
151 210|[f],g,[t] 1518 False 1 [f],g,[t]

Cluster 163

Title: Sequences in cluster 163

clusterid seqid length is_centroid match category
163 217|g,m,q 1518 True 1 g,m,q
163 228|g,q 1518 False 1 g,q

Cluster 475

Title: Sequences in cluster 475

clusterid seqid length is_centroid match category
475 54|k 1485 True 1 k
475 262|(k) 1485 False 1 (k)
475 363|(k) 1485 False 1 (k)
475 365|(k) 1485 False 1 (k)

Cluster 582

Title: Sequences in cluster 582

clusterid seqid length is_centroid match category
582 47|z4,z23,z32 1269 True 1 z4,z23,z32
582 AY353506.1|IV_44:z4z32:-|z4,z32 1269 False 1 z4,z32

Generated using db-check v0.1.4

db-check is on GitHub. Please submit issues

jrober84 commented 4 years ago

Thanks @andersgs for filing the issue, I used https://github.com/andersgs/db-check to clean up the allele database, it is a great tool! There are overlapping alleles in the database but they all belong to a single type.