lskatz commented 2 years ago

Coming off of #11, I expanded the database 10x to see if I could create any collisions by accident. I did this by introducing 1 SNP ten different times per allele. I also kept the original allele. Any identical sequence that was found was disregarded since I am looking at collisions between different sequences.

for i in ~/GWA/projects/validation/mlstComparison/MLST.db/*.chewbbaca; do
  seqtk seq -l 0 <(cat $i/*.fasta) | \
  paste - - | \
  perl -MDigest::MD5=md5_base64 -MDigest::SHA=sha1_base64,sha256_base64 -MString::CRC32=crc32 -lane '
    BEGIN{@NT=qw(A C G T);} 
    next if($seen{$F[1]}++); 
    # Do this 11 times, but one of them will have no mutation
    for my $i(1..11){ 
      $seq=$F[1]; 
      # Don't even bother mutating this allele if we've seen it before
      next if($seen{$seq}++); 
      if($i != 1) {
        substr($seq,rand(length($seq)),1) = $NT[rand(4)];
      } 
      # Don't bother using this mutation of an allele if we've seen it before
      next if($seen{$seq}++); 
      # Print the ID, sequence, hashsums
      print join("\t", $F[0], $seq, md5_base64($seq), sha1_base64($seq), sha256_base64($seq), crc32($seq));
    }
  ' | gzip -c > mutations/$(basename $i .chewbbaca).hashes.tsv.gz & done

lskatz commented 2 years ago

Testing for duplicates in rows 2 through 6:

seqid
sequence
md5
sha1
sha256
crc32

for file in *.tsv.gz; do 
  for i in `seq 2 6`; do 
  # parallelize looking at each column of a single file
  (
    # get the count of times a column has duplications
    count=$(zcat $file | cut -f $i | sort --parallel 24 | uniq -d | wc -l); 
    echo -e "$file $i: $count"
  ) & 
  done; 
  wait; 
done;

lskatz commented 2 years ago

With the A. baumannii scheme, CRC32 has collisions but other hashsum algorithms do not:

Acinetobacter_baumannii.hashes.tsv.gz 3: 0
Acinetobacter_baumannii.hashes.tsv.gz 4: 0
Acinetobacter_baumannii.hashes.tsv.gz 6: 3200
Acinetobacter_baumannii.hashes.tsv.gz 5: 0
Acinetobacter_baumannii.hashes.tsv.gz 2: 0

lskatz commented 2 years ago

Collision with CRC32 again in this scheme

Arcobacter_butzleri.hashes.tsv.gz 6: 44
Arcobacter_butzleri.hashes.tsv.gz 3: 0
Arcobacter_butzleri.hashes.tsv.gz 4: 0
Arcobacter_butzleri.hashes.tsv.gz 5: 0
Arcobacter_butzleri.hashes.tsv.gz 2: 0

This scheme is unscathed

Brucella_melitensis.hashes.tsv.gz 6: 0
Brucella_melitensis.hashes.tsv.gz 3: 0
Brucella_melitensis.hashes.tsv.gz 4: 0
Brucella_melitensis.hashes.tsv.gz 5: 0
Brucella_melitensis.hashes.tsv.gz 2: 0

lskatz commented 2 years ago

Campy has CRC32 collisions

Campylobacter_jejuni.hashes.tsv.gz 6: 589
Campylobacter_jejuni.hashes.tsv.gz 3: 0
Campylobacter_jejuni.hashes.tsv.gz 4: 0
Campylobacter_jejuni.hashes.tsv.gz 5: 0
Campylobacter_jejuni.hashes.tsv.gz 2: 0

lskatz commented 2 years ago

Escherichia has CRC32 collisions

Escherichia_coli.hashes.tsv.gz 3: 0
Escherichia_coli.hashes.tsv.gz 4: 0
Escherichia_coli.hashes.tsv.gz 6: 57161
Escherichia_coli.hashes.tsv.gz 5: 0
Escherichia_coli.hashes.tsv.gz 2: 0

lskatz commented 2 years ago

Listeria has CRC32 collisions

Listeria_monocytogenes.hashes.tsv.gz 3: 0
Listeria_monocytogenes.hashes.tsv.gz 6: 888
Listeria_monocytogenes.hashes.tsv.gz 4: 0
Listeria_monocytogenes.hashes.tsv.gz 5: 0
Listeria_monocytogenes.hashes.tsv.gz 2: 0

lskatz commented 2 years ago

The rest also seem to have an issue with CRC32 but not other hashing algorithms

Salmonella_enterica.hashes.tsv.gz 3: 0
Salmonella_enterica.hashes.tsv.gz 6: 54292
Salmonella_enterica.hashes.tsv.gz 4: 0
Salmonella_enterica.hashes.tsv.gz 5: 0
Salmonella_enterica.hashes.tsv.gz 2: 0

Streptococcus_agalactiae.hashes.tsv.gz 3: 0
Streptococcus_agalactiae.hashes.tsv.gz 6: 385
Streptococcus_agalactiae.hashes.tsv.gz 4: 0
Streptococcus_agalactiae.hashes.tsv.gz 5: 0
Streptococcus_agalactiae.hashes.tsv.gz 2: 0

Streptococcus_pyogenes.hashes.tsv.gz 3: 0
Streptococcus_pyogenes.hashes.tsv.gz 6: 892
Streptococcus_pyogenes.hashes.tsv.gz 4: 0
Streptococcus_pyogenes.hashes.tsv.gz 5: 0
Streptococcus_pyogenes.hashes.tsv.gz 2: 0

Yersinia_enterocolitica.hashes.tsv.gz 6: 138
Yersinia_enterocolitica.hashes.tsv.gz 3: 0
Yersinia_enterocolitica.hashes.tsv.gz 4: 0
Yersinia_enterocolitica.hashes.tsv.gz 5: 0
Yersinia_enterocolitica.hashes.tsv.gz 2: 0

lskatz commented 2 years ago

These are the numbers of alleles per schema after mutating each allele 10x:

$ for i in *.tsv.gz; do echo -ne "$i\t"; zcat $i | wc -l; done;
Acinetobacter_baumannii.hashes.tsv.gz   5244059
Arcobacter_butzleri.hashes.tsv.gz       613209
Brucella_melitensis.hashes.tsv.gz       83958
Campylobacter_jejuni.hashes.tsv.gz      2273102
Escherichia_coli.hashes.tsv.gz  22182678
Listeria_monocytogenes.hashes.tsv.gz    2797807
Salmonella_enterica.hashes.tsv.gz       21647330
Streptococcus_agalactiae.hashes.tsv.gz  1829511
Streptococcus_pyogenes.hashes.tsv.gz    2776379
Yersinia_enterocolitica.hashes.tsv.gz   1155678

E.g., 22M alleles across all loci in the E. coli scheme

lskatz commented 2 years ago

To answer how many alleles there are per locus in each "10x" database with mutations, I made a histogram in some of the larger databases. I am also reading from a file .noseqs.tsv.gz to indicate I replaced the actual allele sequence with - to reduce run times and because I already demonstrated that all alleles are unique in the files.

$ zcat Listeria_monocytogenes.noseqs.tsv.gz | cut -f 1 | \
  perl -lane '
    # grab $locus as everything up until the last underscore
    ($locus, $allele) = split(/_([^_]+)$/, $F[0]); 
    # Ignore custom loci that have shovil or spades or .fasta in their names
    next if($locus =~ /shovill|spades|\.fasta/i); 
    # count the loci in the %locus hash
    $locus{$locus}++; 
    # last step: print locus \t count and round to the nearest 1000
    END{ while(my($locus, $count) = each(%locus)){ $count=int($count/1000)*1000; print "$locus\t$count"; } };
  ' | \
  sort -k2,2n | \
  cut -f 2 | \
  uniq -dc

lskatz commented 2 years ago

I went ahead and ran it for all schemes

Escherichia

Salmonella

Acinobacter

Streptococcus

Campylobacter

Yersinia

   6342 0

Arcobacter

   7453 0

Brucella

   2762 0

lskatz / mlst-hash-template

Testing for even more collisions #13

Escherichia

Salmonella

Acinobacter

Streptococcus

Campylobacter

Yersinia

Arcobacter

Brucella