Closed shz9 closed 6 months ago
Great catch, thanks @shz9. The problem with old phase 1 files is that the ancient version of tabix here doesn't store the record counts in the index that we use. It works fine if you re-index.
I'll add an update that either checks for these missing bits of information in the index and raises and informative error, or just prints a warning and handles the fact we can't show progress.
This issue also affects the newer NYGC WGS VCF files. Could be a bug that was introduced in recent updates? May be related to #144 .
I've been working with these without problems - can you show the backtrace please?
It's the same as above:
Explode: 1.87Mvars [57:06, 547vars/s]
total_records: 1872741
metadata.num_records: 0
0
Traceback (most recent call last):
File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module>
sys.exit(vcf2zarr())
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 178, in explode
vcf.explode(
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1176, in explode
writer.finalise()
File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1142, in finalise
assert total_records == self.metadata.num_records
AssertionError
The VCF files are here. I think this is probably the same issue you identified for the older 1000G data, because during conversion, the hts
library was throwing the following warnings:
[W::hts_idx_load3] The index file is older than the data file: data/WGS/chr22.vcf.gz.tbi
And indeed, on the website linked to above, it seems that the .tbi
files are older than the files they are indexing. So, I will try to re-index everything before proceeding, but as you said, it'd be good to detect those issues at the scan
stage so that they can be corrected before exploding the files.
Closed in #164
I'm using
bio2zarr v0.0.6
and trying to explode 1000G genotype data (NOTE: not recent NYGC WGS data; but older genotype data from ~2013):It seems there's a mismatch between the metadata and the number of records in each of the partitions? This error still comes up when I use a single worker
-p1
, so it's not due to multiprocessing. I printedtotal_records
andself.metadata.num_records
and got the following:Any ideas what might be going on? It seems like it could be an issue in
scan
due to corrupted or outdated VCF format?UPDATE:
This issue also affects the newer NYGC WGS VCF files. Could be a bug that was introduced in recent updates? May be related to #144 .