lskatz / mlst-hash-template

This is a template for any new hash-based MLST database
GNU General Public License v3.0
5 stars 2 forks source link

Testing for even more collisions #13

Closed lskatz closed 2 years ago

lskatz commented 2 years ago

Coming off of #11, I expanded the database 10x to see if I could create any collisions by accident. I did this by introducing 1 SNP ten different times per allele. I also kept the original allele. Any identical sequence that was found was disregarded since I am looking at collisions between different sequences.

for i in ~/GWA/projects/validation/mlstComparison/MLST.db/*.chewbbaca; do
  seqtk seq -l 0 <(cat $i/*.fasta) | \
  paste - - | \
  perl -MDigest::MD5=md5_base64 -MDigest::SHA=sha1_base64,sha256_base64 -MString::CRC32=crc32 -lane '
    BEGIN{@NT=qw(A C G T);} 
    next if($seen{$F[1]}++); 
    # Do this 11 times, but one of them will have no mutation
    for my $i(1..11){ 
      $seq=$F[1]; 
      # Don't even bother mutating this allele if we've seen it before
      next if($seen{$seq}++); 
      if($i != 1) {
        substr($seq,rand(length($seq)),1) = $NT[rand(4)];
      } 
      # Don't bother using this mutation of an allele if we've seen it before
      next if($seen{$seq}++); 
      # Print the ID, sequence, hashsums
      print join("\t", $F[0], $seq, md5_base64($seq), sha1_base64($seq), sha256_base64($seq), crc32($seq));
    }
  ' | gzip -c > mutations/$(basename $i .chewbbaca).hashes.tsv.gz & done
lskatz commented 2 years ago

Testing for duplicates in rows 2 through 6:

  1. seqid
  2. sequence
  3. md5
  4. sha1
  5. sha256
  6. crc32
for file in *.tsv.gz; do 
  for i in `seq 2 6`; do 
  # parallelize looking at each column of a single file
  (
    # get the count of times a column has duplications
    count=$(zcat $file | cut -f $i | sort --parallel 24 | uniq -d | wc -l); 
    echo -e "$file $i: $count"
  ) & 
  done; 
  wait; 
done;
lskatz commented 2 years ago

With the A. baumannii scheme, CRC32 has collisions but other hashsum algorithms do not:

Acinetobacter_baumannii.hashes.tsv.gz 3: 0
Acinetobacter_baumannii.hashes.tsv.gz 4: 0
Acinetobacter_baumannii.hashes.tsv.gz 6: 3200
Acinetobacter_baumannii.hashes.tsv.gz 5: 0
Acinetobacter_baumannii.hashes.tsv.gz 2: 0
lskatz commented 2 years ago

Collision with CRC32 again in this scheme

Arcobacter_butzleri.hashes.tsv.gz 6: 44
Arcobacter_butzleri.hashes.tsv.gz 3: 0
Arcobacter_butzleri.hashes.tsv.gz 4: 0
Arcobacter_butzleri.hashes.tsv.gz 5: 0
Arcobacter_butzleri.hashes.tsv.gz 2: 0

This scheme is unscathed

Brucella_melitensis.hashes.tsv.gz 6: 0
Brucella_melitensis.hashes.tsv.gz 3: 0
Brucella_melitensis.hashes.tsv.gz 4: 0
Brucella_melitensis.hashes.tsv.gz 5: 0
Brucella_melitensis.hashes.tsv.gz 2: 0
lskatz commented 2 years ago

Campy has CRC32 collisions

Campylobacter_jejuni.hashes.tsv.gz 6: 589
Campylobacter_jejuni.hashes.tsv.gz 3: 0
Campylobacter_jejuni.hashes.tsv.gz 4: 0
Campylobacter_jejuni.hashes.tsv.gz 5: 0
Campylobacter_jejuni.hashes.tsv.gz 2: 0
lskatz commented 2 years ago

Escherichia has CRC32 collisions

Escherichia_coli.hashes.tsv.gz 3: 0
Escherichia_coli.hashes.tsv.gz 4: 0
Escherichia_coli.hashes.tsv.gz 6: 57161
Escherichia_coli.hashes.tsv.gz 5: 0
Escherichia_coli.hashes.tsv.gz 2: 0
lskatz commented 2 years ago

Listeria has CRC32 collisions

Listeria_monocytogenes.hashes.tsv.gz 3: 0
Listeria_monocytogenes.hashes.tsv.gz 6: 888
Listeria_monocytogenes.hashes.tsv.gz 4: 0
Listeria_monocytogenes.hashes.tsv.gz 5: 0
Listeria_monocytogenes.hashes.tsv.gz 2: 0
lskatz commented 2 years ago

The rest also seem to have an issue with CRC32 but not other hashing algorithms

Salmonella_enterica.hashes.tsv.gz 3: 0
Salmonella_enterica.hashes.tsv.gz 6: 54292
Salmonella_enterica.hashes.tsv.gz 4: 0
Salmonella_enterica.hashes.tsv.gz 5: 0
Salmonella_enterica.hashes.tsv.gz 2: 0

Streptococcus_agalactiae.hashes.tsv.gz 3: 0
Streptococcus_agalactiae.hashes.tsv.gz 6: 385
Streptococcus_agalactiae.hashes.tsv.gz 4: 0
Streptococcus_agalactiae.hashes.tsv.gz 5: 0
Streptococcus_agalactiae.hashes.tsv.gz 2: 0

Streptococcus_pyogenes.hashes.tsv.gz 3: 0
Streptococcus_pyogenes.hashes.tsv.gz 6: 892
Streptococcus_pyogenes.hashes.tsv.gz 4: 0
Streptococcus_pyogenes.hashes.tsv.gz 5: 0
Streptococcus_pyogenes.hashes.tsv.gz 2: 0

Yersinia_enterocolitica.hashes.tsv.gz 6: 138
Yersinia_enterocolitica.hashes.tsv.gz 3: 0
Yersinia_enterocolitica.hashes.tsv.gz 4: 0
Yersinia_enterocolitica.hashes.tsv.gz 5: 0
Yersinia_enterocolitica.hashes.tsv.gz 2: 0
lskatz commented 2 years ago

These are the numbers of alleles per schema after mutating each allele 10x:

$ for i in *.tsv.gz; do echo -ne "$i\t"; zcat $i | wc -l; done;
Acinetobacter_baumannii.hashes.tsv.gz   5244059
Arcobacter_butzleri.hashes.tsv.gz       613209
Brucella_melitensis.hashes.tsv.gz       83958
Campylobacter_jejuni.hashes.tsv.gz      2273102
Escherichia_coli.hashes.tsv.gz  22182678
Listeria_monocytogenes.hashes.tsv.gz    2797807
Salmonella_enterica.hashes.tsv.gz       21647330
Streptococcus_agalactiae.hashes.tsv.gz  1829511
Streptococcus_pyogenes.hashes.tsv.gz    2776379
Yersinia_enterocolitica.hashes.tsv.gz   1155678

E.g., 22M alleles across all loci in the E. coli scheme

lskatz commented 2 years ago

To answer how many alleles there are per locus in each "10x" database with mutations, I made a histogram in some of the larger databases. I am also reading from a file .noseqs.tsv.gz to indicate I replaced the actual allele sequence with - to reduce run times and because I already demonstrated that all alleles are unique in the files.

$ zcat Listeria_monocytogenes.noseqs.tsv.gz | cut -f 1 | \
  perl -lane '
    # grab $locus as everything up until the last underscore
    ($locus, $allele) = split(/_([^_]+)$/, $F[0]); 
    # Ignore custom loci that have shovil or spades or .fasta in their names
    next if($locus =~ /shovill|spades|\.fasta/i); 
    # count the loci in the %locus hash
    $locus{$locus}++; 
    # last step: print locus \t count and round to the nearest 1000
    END{ while(my($locus, $count) = each(%locus)){ $count=int($count/1000)*1000; print "$locus\t$count"; } };
  ' | \
  sort -k2,2n | \
  cut -f 2 | \
  uniq -dc
    441 0
    867 1000
    321 2000
     96 3000
     14 4000
      7 5000
      2 6000
lskatz commented 2 years ago

I went ahead and ran it for all schemes

Escherichia

   3571 0
    795 1000
    531 2000
    415 3000
    436 4000
    408 5000
    368 6000
    321 7000
    238 8000
    169 9000
    108 10000
     75 11000
     43 12000
     47 13000
     26 14000
     19 15000
      9 16000
      8 17000
      5 18000
      6 20000
      2 21000

Salmonella

   4385 0
    859 1000
    524 2000
    436 3000
    443 4000
    514 5000
    453 6000
    361 7000
    268 8000
    125 9000
     94 10000
     48 11000
     23 12000
     13 13000
      9 14000
      2 15000

Acinobacter

    349 0
    673 1000
    818 2000
    462 3000
     78 4000
     10 5000

Streptococcus

   1964 0
    602 1000
    312 2000
    123 3000
     29 4000
      7 5000
      3 6000

Campylobacter

   1900 0
    449 1000
    309 2000
    101 3000
     30 4000
      3 5000

Yersinia

   6342 0

Arcobacter

   7453 0

Brucella

   2762 0