Closed benjeffery closed 6 months ago
This looks like empty partitions, similar to the issue we had on explode?
I can confirm that in IntermediateColumnarFormatField.chunks
chunk_num_records[start_chunk:]
is []
and chunk_cumulative_records[start_chunk + 1 :]
is also []
Can you run that again with -vv
please to get debug output?
Can you have a look at the ICF metadata file and see if there any partition with "num_records": 0"
?
There are no zeros in the ICF metadata.
What version of bio2zarr have you here? Line 605 is nowhere near iter_values
in main
Can you view the icf metadata please, and show the last few partitions. Must be something odd to do with tabix indexing
What version of bio2zarr have you here? Line 605 is nowhere near
iter_values
in main
This was on e6b9a
after #136 was merged.
Can you view the icf metadata please, and show the last few partitions. Must be something odd to do with tabix indexing
This was on e6b9a after #136 was merged.
I don't think it can be - this is line 605 on that commit: https://github.com/sgkit-dev/bio2zarr/blob/e6b9a1970984c073fa2249fc45b41c2bb5d8fa3b/bio2zarr/vcf.py#L605
Also it looks like you're pulling the bio2zarr code from a local directory - it would be better to install to the environment. There's regular releases for just this.
Very puzzling... Can you give me the output of these commands please:
$ls [ICF_PATH]/REF/p14761
and
python3 -c 'import pickle; print(pickle.load(open("[ICF_PATH]/REF/p14761/chunk_index", "rb")))'
(Regretting not making the chunk indexes a simple JSON file now!)
This was on e6b9a after #136 was merged.
I don't think it can be - this is line 605 on that commit:
It's 685, not 605. Sorry I can't cut and paste things!
'import pickle; print(pickle.load(open("[ICF_PATH]/REF/p14761/chunk_index", "rb")))'
It's 685, not 605. Sorry I can't cut and paste things!
Ahhhh - sorry, I should have zoomed in! :facepalm:
OK, looks like it's a problem with the indexing code. We should have start_chunk=0
(there's only one chunk) but for some reason it's coming out as 1. Would you mind dropping into the debugger please and seeing if you can track this down @benjeffery?
I'm going to see if I can replicate on the chr2 1000G data - we'll see how long it all takes when there's only 40 processes.
I can't reproduce this on 1000G - seems to work fine with maximal partitioning. There's quite a lot fewer variants though.
Is it just this particular partition that the error occurs on or are there others?
There were a few hundred that failed with this error - digging into this today.
Trying to understand what is going on here:
In iter_values
We end up with a chunk_offset
of 29838. This is greater than the largest value (23135) in chunk_record_index
so we end up with a start_chunk
that is greater than the number of chunks, so chunks
then returns an empty list.
I've also checked and it is 550 failing partitions spread non-contiguously from 1708-5985.
Ah-ha, wonder if this has something to do with it. Running dexplode-init
again and check_overlap
finds an overlap. That check was added after I exploded these VCFs.
Ahhhh
We shouldn't have overlaps here though, right? Can you give some details?
The list of filename, partition.region.start and partition.reigion.end looks like this:
Really odd that a VCF in the middle contains 1
. Checking the VCF to be sure.
A bunch of weird stuff here... are you sure the files are all part of the same set?
They are all in the same folder and have the same naming convention. The 58219159 file contains variants that start at that position.
So... that one VCF file, the CSIIndex
has the first indexed position as 1
. No other VCFs in the set have this.
Cheking this is in fact the case in the CSI and not an issue in the index loader.
So this is the only index file that has a bin (bin 9
) which is the first bin on it's level (level 2
). Tracing this all the way back to the bytes in the CSI file confirms this.
I'm re-indexing the file to see if this is the expected behavior.
What's the position of the first variant in that file?
58720256
OK, I'm rejigging a bunch of things to avoid using the record counts from the indexes. They're really not reliable, and not present in a lot of cases.
Hopefully close in #164. I'm going to push out a release in a few minutes so we can test @benjeffery