Closed lskatz closed 1 year ago
Trying it out here: https://github.com/lskatz/mlst-hash-template/compare/master...plaintext
$ perl scripts/digestFasta.pl --hash plaintext ~/GWA/projects/validation/mlstComparison/MLST.db/Salmonella_enterica.chewbbaca/*.fasta --out salmonella_enterica.plaintext
$ cd salmonella_enterica.plaintext
$ ls -lh
total 3.4G
-rw-------. 1 gzu2 users 3.4G Jan 31 17:27 alleles.tsv
-rw-------. 1 gzu2 users 7.0M Jan 31 17:27 ref.fasta
$ sort -k2,2 alleles.tsv > sorted.tsv
$ ls -lhS
total 6.7G
-rw-------. 1 gzu2 users 3.4G Jan 31 17:27 alleles.tsv
-rw-------. 1 gzu2 users 3.4G Jan 31 20:01 sorted.tsv
-rw-------. 1 gzu2 users 7.0M Jan 31 17:27 ref.fasta
$ gzip *.tsv
$ ls -lh
total 191M
-rw-------. 1 gzu2 users 110M Jan 31 17:27 alleles.tsv.gz
-rw-------. 1 gzu2 users 7.0M Jan 31 17:27 ref.fasta
-rw-------. 1 gzu2 users 74M Jan 31 20:01 sorted.tsv.gz
So the vanilla S. enterica database can compress down to about 74M with default parameters if sorted. Not including ref.fasta which will not be tested here.
$ perl scripts/digestFasta.pl --hash md5 ~/GWA/projects/validation/mlstComparison/MLST.db/Salmonella_enterica.chewbbaca/*.fasta --out salmonella_enterica.md5
$ cd salmonella_enterica.md5
$ ls -lhS
total 152M
-rw-------. 1 gzu2 users 145M Jan 31 20:09 alleles.tsv
-rw-------. 1 gzu2 users 7.0M Jan 31 20:09 ref.fasta
$ sort -k2,2 alleles.tsv > sorted.tsv
$ ls -lhS
total 297M
-rw-------. 1 gzu2 users 145M Jan 31 20:09 alleles.tsv
-rw-------. 1 gzu2 users 145M Jan 31 20:10 sorted.tsv
-rw-------. 1 gzu2 users 7.0M Jan 31 20:09 ref.fasta
$ gzip alleles.tsv & gzip sorted.tsv
[1] 20744
[1]+ Done gzip alleles.tsv
$ ls -lhS
total 117M
-rw-------. 1 gzu2 users 58M Jan 31 20:10 sorted.tsv.gz
-rw-------. 1 gzu2 users 52M Jan 31 20:09 alleles.tsv.gz
-rw-------. 1 gzu2 users 7.0M Jan 31 20:09 ref.fasta
Ok and in this case with md5, it goes from 145M to 52M. Surprisingly, sorting on the hash makes the size go up.
$ zcat alleles.tsv.gz | sort -k1,1 | gzip -c > alleles.k1.tsv.gz
$ zcat alleles.tsv.gz | sort -k1,2 | gzip -c > alleles.k12.tsv.gz
$ ls -lhS
total 220M
-rw-------. 1 gzu2 users 58M Jan 31 20:10 sorted.tsv.gz
-rw-------. 1 gzu2 users 52M Jan 31 20:09 alleles.tsv.gz
-rw-------. 1 gzu2 users 52M Jan 31 20:15 alleles.k12.tsv.gz
-rw-------. 1 gzu2 users 52M Jan 31 20:13 alleles.k1.tsv.gz
-rw-------. 1 gzu2 users 7.0M Jan 31 20:09 ref.fasta
And sorting on the first field or first and then second field doesn't give any gain. It's all 52M.
What about sha256?
$ perl scripts/digestFasta.pl --hash sha256 ~/GWA/projects/validation/mlstComparison/MLST.db/Salmonella_enterica.chewbbaca/*.fasta --out salmonella_enterica.sha256
$ cat alleles.tsv | sort -k1,2 | gzip -c > alleles.k12.tsv.gz &
[1] 26373
$ cat alleles.tsv | sort -k1,1 | gzip -c > alleles.k1.tsv.gz &
[2] 26432
$ cat alleles.tsv | sort -k2,2 | gzip -c > alleles.k2.tsv.gz &
[3] 26521
$ ls -lhS
total 513M
-rw-------. 1 gzu2 users 212M Jan 31 20:18 alleles.tsv
-rw-------. 1 gzu2 users 103M Jan 31 20:20 alleles.k2.tsv.gz
-rw-------. 1 gzu2 users 96M Jan 31 20:20 alleles.k12.tsv.gz
-rw-------. 1 gzu2 users 96M Jan 31 20:20 alleles.k1.tsv.gz
-rw-------. 1 gzu2 users 7.0M Jan 31 20:18 ref.fasta
In this case, sorting on the hash did not help at all. However, compression took us from 212M down to 96M.
So in the end, the worst compression came from sha256 hashes which got us down to 96M, then plaintext to 74M, then md5, which is at 52M.
See how well a database can compress if the allele is not hashed and it just uses the "plaintext" algorithm. Otherwise, it is in the same format, with one allele per line.