Discrepancy in Results When Replicating Compression of 1000 Genomes Project Dataset Using GVC Compression Tool

Content: Hello,

I have been attempting to replicate your code but found significant discrepancies between my results and those reported in your paper. The environment I used is as follows:

The computer used in tests had the following configuration: 2 Intel Xeon E5-2620 v3 CPUs, with 6 double-threaded cores per CPU, totaling 24 threads, each clocked at 2.4 GHz, 128 GiB RAM, 1 SSD of size 2 TiB, with a buffered read speed of 456.04 MB/sec as reported by hdparm -t. Python version: python3.8 My replication steps were:

Clone the repository: git clone https://github.com/sXperfect/gvc Run setup: bash setup.sh Use the encoder: JBIG (jbgtopbm85) Install the package: python setup.py install Execute the command: python -m gvc encode 1000GP3/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz chr1.gvc

I expected the result for chr1.gvc to be around 70MB, but I obtained chr1.gvc with a size of 321.12 MB. This size does not include the chr1.gvc.metadata folder. Could you please suggest possible reasons for this discrepancy and any potential solutions?

sXperfect / gvc

Discrepancy in Results When Replicating Compression of 1000 Genomes Project Dataset Using GVC Compression Tool #1