sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
28 stars 7 forks source link

VCF indexes without record counts lead to assertion #150

Closed shz9 closed 6 months ago

shz9 commented 7 months ago

I'm using bio2zarr v0.0.6 and trying to explode 1000G genotype data (NOTE: not recent NYGC WGS data; but older genotype data from ~2013):

vcf2zarr explode data/genotypes/chr22.vcf.gz data/genotypes/chr22.icf -p8
    Scan: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 14.1files/s]
 Explode: 1.10Mvars [02:45, 6.65kvars/s]
Traceback (most recent call last):
  File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module>
    sys.exit(vcf2zarr())
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 178, in explode
    vcf.explode(
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1173, in explode
    writer.finalise()
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1139, in finalise
    assert total_records == self.metadata.num_records
AssertionError

It seems there's a mismatch between the metadata and the number of records in each of the partitions? This error still comes up when I use a single worker -p1, so it's not due to multiprocessing. I printed total_records and self.metadata.num_records and got the following:

total_records: 1103547
metadata.num_records: 0

Any ideas what might be going on? It seems like it could be an issue in scan due to corrupted or outdated VCF format?

UPDATE:

This issue also affects the newer NYGC WGS VCF files. Could be a bug that was introduced in recent updates? May be related to #144 .

jeromekelleher commented 7 months ago

Great catch, thanks @shz9. The problem with old phase 1 files is that the ancient version of tabix here doesn't store the record counts in the index that we use. It works fine if you re-index.

I'll add an update that either checks for these missing bits of information in the index and raises and informative error, or just prints a warning and handles the fact we can't show progress.

jeromekelleher commented 7 months ago

This issue also affects the newer NYGC WGS VCF files. Could be a bug that was introduced in recent updates? May be related to #144 .

I've been working with these without problems - can you show the backtrace please?

shz9 commented 7 months ago

It's the same as above:

 Explode: 1.87Mvars [57:06, 547vars/s]  
total_records: 1872741
metadata.num_records: 0
0
Traceback (most recent call last):
  File "/home/szabad/bio2zarr_env/bin/vcf2zarr", line 8, in <module>
    sys.exit(vcf2zarr())
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/cli.py", line 178, in explode
    vcf.explode(
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1176, in explode
    writer.finalise()
  File "/home/szabad/bio2zarr_env/lib/python3.9/site-packages/bio2zarr/vcf.py", line 1142, in finalise
    assert total_records == self.metadata.num_records
AssertionError

The VCF files are here. I think this is probably the same issue you identified for the older 1000G data, because during conversion, the hts library was throwing the following warnings:

[W::hts_idx_load3] The index file is older than the data file: data/WGS/chr22.vcf.gz.tbi

And indeed, on the website linked to above, it seems that the .tbi files are older than the files they are indexing. So, I will try to re-index everything before proceeding, but as you said, it'd be good to detect those issues at the scan stage so that they can be corrected before exploding the files.

jeromekelleher commented 6 months ago

Closed in #164