Closed sofroniewn closed 1 year ago
Thanks for the great bug report @sofroniewn! I've managed to recreate it locally, and am looking for a workaround or fix.
Ah! This VCF has no samples in! What were you planning to do with it in sgkit?
I'll write up an issue for sgkit to deal with this situation more gracefully.
There is now a PR to fix this behaviour in https://github.com/pystatgen/sgkit/pull/1069
For now, you should be able to work around this by using vcf_to_zarr("clinvar.vcf.gz", "clinvar.zarr", regions=["1"])
I've managed to recreate it locally, and am looking for a workaround or fix.
Ok great!
For now, you should be able to work around this by using vcf_to_zarr("clinvar.vcf.gz", "clinvar.zarr", regions=["1"])
This actually still seems to give me the same error.
Ah! This VCF has no samples in! What were you planning to do with it in sgkit?
Huh - ok, I am very new to VCF files (this is the first one I've ever tried to load!) so I'm not even sure what that really means. It seems to have variants inside it.
As to my goals, I have a BAM file specifying some genome intervals, and a fasta file for the hg38 reference. My goal is to take an interval in the BAM file, find which clinvar SNP variants are inside that region and then for each one substitute that SNP into the dna sequence from the reference to generate an alternative sequence.
Right now I was actually able to use pysam
and vcf = pysam.VariantFile(vcf_file, 'r')
to read in the file and can go from there, but I think I'd prefer to use sgkit
if possible as I think I will prefer the more python API and metadata handling.
Thanks for your prompt bug-fix. I can test again in the next release. Feel free to close this issue when you want.
Ah, sorry, my bad! Try this workaround, that bypasses the parallel loading code:
from sgkit.io.vcf.vcf_reader import vcf_to_zarr_sequential
vcf_to_zarr_sequential("clinvar.vcf.gz", "clinvar.zarr")
You should then be able to use the arrays of positions and alleles to get your analysis done:
>>> ds = sgkit.load_dataset("clinvar.zarr")
>>> ds
<xarray.Dataset>
Dimensions: (contigs: 29, filters: 1, samples: 0, variants: 2174000,
alleles: 4)
Dimensions without coordinates: contigs, filters, samples, variants, alleles
Data variables:
contig_id (contigs) <U14 dask.array<chunksize=(29,), meta=np.ndarray>
filter_id (filters) object dask.array<chunksize=(1,), meta=np.ndarray>
sample_id (samples) float64 dask.array<chunksize=(0,), meta=np.ndarray>
variant_allele (variants, alleles) object dask.array<chunksize=(10000, 4), meta=np.ndarray>
variant_contig (variants) int8 dask.array<chunksize=(10000,), meta=np.ndarray>
variant_filter (variants, filters) bool dask.array<chunksize=(10000, 1), meta=np.ndarray>
variant_id (variants) object dask.array<chunksize=(10000,), meta=np.ndarray>
variant_id_mask (variants) bool dask.array<chunksize=(10000,), meta=np.ndarray>
variant_position (variants) int32 dask.array<chunksize=(10000,), meta=np.ndarray>
variant_quality (variants) float32 dask.array<chunksize=(10000,), meta=np.ndarray>
Attributes:
contigs: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'...
filters: ['PASS']
max_alt_alleles_seen: 1
source: sgkit-0.5.1.dev153+g8179307.d20230301
vcf_header: ##fileformat=VCFv4.1\n##FILTER=<ID=PASS,Descriptio...
vcf_zarr_version: 0.2
>>> ds.variant_position.values
array([ 69134, 69581, 69682, ..., 274366, 275068, 83614], dtype=int32)
Closing, as this should be fixed by #1069. @sofroniewn feel free to open a new issue if you have more questions.
Hello
I am fairly new to working with vcf files so this may be a basic mistake that I am making, but I would have expected the following to work.
I am trying to read the ClinVar vcf file located at that page. I downloaded both clinvar.vcf.gz and clinvar.vcf.gz.tbi and then ran the following with
sgkit 0.6.0
and got the following error
Do I need to specify more input arguments? Any help reading the clinvar vcf file would be much appreciated. Thanks!!