Closed lskatz closed 2 years ago
Testing for duplicates in rows 2 through 6:
for file in *.tsv.gz; do
for i in `seq 2 6`; do
# parallelize looking at each column of a single file
(
# get the count of times a column has duplications
count=$(zcat $file | cut -f $i | sort --parallel 24 | uniq -d | wc -l);
echo -e "$file $i: $count"
) &
done;
wait;
done;
With the A. baumannii scheme, CRC32 has collisions but other hashsum algorithms do not:
Acinetobacter_baumannii.hashes.tsv.gz 3: 0
Acinetobacter_baumannii.hashes.tsv.gz 4: 0
Acinetobacter_baumannii.hashes.tsv.gz 6: 3200
Acinetobacter_baumannii.hashes.tsv.gz 5: 0
Acinetobacter_baumannii.hashes.tsv.gz 2: 0
Collision with CRC32 again in this scheme
Arcobacter_butzleri.hashes.tsv.gz 6: 44
Arcobacter_butzleri.hashes.tsv.gz 3: 0
Arcobacter_butzleri.hashes.tsv.gz 4: 0
Arcobacter_butzleri.hashes.tsv.gz 5: 0
Arcobacter_butzleri.hashes.tsv.gz 2: 0
This scheme is unscathed
Brucella_melitensis.hashes.tsv.gz 6: 0
Brucella_melitensis.hashes.tsv.gz 3: 0
Brucella_melitensis.hashes.tsv.gz 4: 0
Brucella_melitensis.hashes.tsv.gz 5: 0
Brucella_melitensis.hashes.tsv.gz 2: 0
Campy has CRC32 collisions
Campylobacter_jejuni.hashes.tsv.gz 6: 589
Campylobacter_jejuni.hashes.tsv.gz 3: 0
Campylobacter_jejuni.hashes.tsv.gz 4: 0
Campylobacter_jejuni.hashes.tsv.gz 5: 0
Campylobacter_jejuni.hashes.tsv.gz 2: 0
Escherichia has CRC32 collisions
Escherichia_coli.hashes.tsv.gz 3: 0
Escherichia_coli.hashes.tsv.gz 4: 0
Escherichia_coli.hashes.tsv.gz 6: 57161
Escherichia_coli.hashes.tsv.gz 5: 0
Escherichia_coli.hashes.tsv.gz 2: 0
Listeria has CRC32 collisions
Listeria_monocytogenes.hashes.tsv.gz 3: 0
Listeria_monocytogenes.hashes.tsv.gz 6: 888
Listeria_monocytogenes.hashes.tsv.gz 4: 0
Listeria_monocytogenes.hashes.tsv.gz 5: 0
Listeria_monocytogenes.hashes.tsv.gz 2: 0
The rest also seem to have an issue with CRC32 but not other hashing algorithms
Salmonella_enterica.hashes.tsv.gz 3: 0
Salmonella_enterica.hashes.tsv.gz 6: 54292
Salmonella_enterica.hashes.tsv.gz 4: 0
Salmonella_enterica.hashes.tsv.gz 5: 0
Salmonella_enterica.hashes.tsv.gz 2: 0
Streptococcus_agalactiae.hashes.tsv.gz 3: 0
Streptococcus_agalactiae.hashes.tsv.gz 6: 385
Streptococcus_agalactiae.hashes.tsv.gz 4: 0
Streptococcus_agalactiae.hashes.tsv.gz 5: 0
Streptococcus_agalactiae.hashes.tsv.gz 2: 0
Streptococcus_pyogenes.hashes.tsv.gz 3: 0
Streptococcus_pyogenes.hashes.tsv.gz 6: 892
Streptococcus_pyogenes.hashes.tsv.gz 4: 0
Streptococcus_pyogenes.hashes.tsv.gz 5: 0
Streptococcus_pyogenes.hashes.tsv.gz 2: 0
Yersinia_enterocolitica.hashes.tsv.gz 6: 138
Yersinia_enterocolitica.hashes.tsv.gz 3: 0
Yersinia_enterocolitica.hashes.tsv.gz 4: 0
Yersinia_enterocolitica.hashes.tsv.gz 5: 0
Yersinia_enterocolitica.hashes.tsv.gz 2: 0
These are the numbers of alleles per schema after mutating each allele 10x:
$ for i in *.tsv.gz; do echo -ne "$i\t"; zcat $i | wc -l; done;
Acinetobacter_baumannii.hashes.tsv.gz 5244059
Arcobacter_butzleri.hashes.tsv.gz 613209
Brucella_melitensis.hashes.tsv.gz 83958
Campylobacter_jejuni.hashes.tsv.gz 2273102
Escherichia_coli.hashes.tsv.gz 22182678
Listeria_monocytogenes.hashes.tsv.gz 2797807
Salmonella_enterica.hashes.tsv.gz 21647330
Streptococcus_agalactiae.hashes.tsv.gz 1829511
Streptococcus_pyogenes.hashes.tsv.gz 2776379
Yersinia_enterocolitica.hashes.tsv.gz 1155678
E.g., 22M alleles across all loci in the E. coli scheme
To answer how many alleles there are per locus in each "10x" database with mutations, I made a histogram in some of the larger databases. I am also reading from a file .noseqs.tsv.gz
to indicate I replaced the actual allele sequence with -
to reduce run times and because I already demonstrated that all alleles are unique in the files.
$ zcat Listeria_monocytogenes.noseqs.tsv.gz | cut -f 1 | \
perl -lane '
# grab $locus as everything up until the last underscore
($locus, $allele) = split(/_([^_]+)$/, $F[0]);
# Ignore custom loci that have shovil or spades or .fasta in their names
next if($locus =~ /shovill|spades|\.fasta/i);
# count the loci in the %locus hash
$locus{$locus}++;
# last step: print locus \t count and round to the nearest 1000
END{ while(my($locus, $count) = each(%locus)){ $count=int($count/1000)*1000; print "$locus\t$count"; } };
' | \
sort -k2,2n | \
cut -f 2 | \
uniq -dc
441 0
867 1000
321 2000
96 3000
14 4000
7 5000
2 6000
I went ahead and ran it for all schemes
3571 0
795 1000
531 2000
415 3000
436 4000
408 5000
368 6000
321 7000
238 8000
169 9000
108 10000
75 11000
43 12000
47 13000
26 14000
19 15000
9 16000
8 17000
5 18000
6 20000
2 21000
4385 0
859 1000
524 2000
436 3000
443 4000
514 5000
453 6000
361 7000
268 8000
125 9000
94 10000
48 11000
23 12000
13 13000
9 14000
2 15000
349 0
673 1000
818 2000
462 3000
78 4000
10 5000
1964 0
602 1000
312 2000
123 3000
29 4000
7 5000
3 6000
1900 0
449 1000
309 2000
101 3000
30 4000
3 5000
6342 0
7453 0
2762 0
Coming off of #11, I expanded the database 10x to see if I could create any collisions by accident. I did this by introducing 1 SNP ten different times per allele. I also kept the original allele. Any identical sequence that was found was disregarded since I am looking at collisions between different sequences.