sgkit-dev / bio2zarr

Convert bioinformatics file formats to Zarr
Apache License 2.0
28 stars 7 forks source link

`dencode-partition` fails with `AssertionError` #144

Closed benjeffery closed 6 months ago

benjeffery commented 7 months ago

Screenshot from 2024-04-25 12-14-47

benjeffery commented 7 months ago

This looks like empty partitions, similar to the issue we had on explode?

benjeffery commented 7 months ago

I can confirm that in IntermediateColumnarFormatField.chunks chunk_num_records[start_chunk:] is [] and chunk_cumulative_records[start_chunk + 1 :] is also []

jeromekelleher commented 7 months ago

Can you run that again with -vv please to get debug output?

Can you have a look at the ICF metadata file and see if there any partition with "num_records": 0"?

benjeffery commented 7 months ago

Screenshot from 2024-04-25 16-01-01

There are no zeros in the ICF metadata.

jeromekelleher commented 7 months ago

What version of bio2zarr have you here? Line 605 is nowhere near iter_values in main

jeromekelleher commented 7 months ago

Can you view the icf metadata please, and show the last few partitions. Must be something odd to do with tabix indexing

benjeffery commented 7 months ago

What version of bio2zarr have you here? Line 605 is nowhere near iter_values in main

This was on e6b9a after #136 was merged.

benjeffery commented 7 months ago

Can you view the icf metadata please, and show the last few partitions. Must be something odd to do with tabix indexing

Screenshot from 2024-04-26 10-01-14

jeromekelleher commented 7 months ago

This was on e6b9a after #136 was merged.

I don't think it can be - this is line 605 on that commit: https://github.com/sgkit-dev/bio2zarr/blob/e6b9a1970984c073fa2249fc45b41c2bb5d8fa3b/bio2zarr/vcf.py#L605

jeromekelleher commented 7 months ago

Also it looks like you're pulling the bio2zarr code from a local directory - it would be better to install to the environment. There's regular releases for just this.

jeromekelleher commented 7 months ago

Very puzzling... Can you give me the output of these commands please:

$ls [ICF_PATH]/REF/p14761

and

python3 -c 'import pickle; print(pickle.load(open("[ICF_PATH]/REF/p14761/chunk_index", "rb")))'

(Regretting not making the chunk indexes a simple JSON file now!)

benjeffery commented 7 months ago

This was on e6b9a after #136 was merged.

I don't think it can be - this is line 605 on that commit:

https://github.com/sgkit-dev/bio2zarr/blob/e6b9a1970984c073fa2249fc45b41c2bb5d8fa3b/bio2zarr/vcf.py#L605

It's 685, not 605. Sorry I can't cut and paste things!

benjeffery commented 7 months ago

'import pickle; print(pickle.load(open("[ICF_PATH]/REF/p14761/chunk_index", "rb")))'

Screenshot from 2024-04-26 11-47-08

jeromekelleher commented 7 months ago

It's 685, not 605. Sorry I can't cut and paste things!

Ahhhh - sorry, I should have zoomed in! :facepalm:

jeromekelleher commented 7 months ago

OK, looks like it's a problem with the indexing code. We should have start_chunk=0 (there's only one chunk) but for some reason it's coming out as 1. Would you mind dropping into the debugger please and seeing if you can track this down @benjeffery?

jeromekelleher commented 7 months ago

I'm going to see if I can replicate on the chr2 1000G data - we'll see how long it all takes when there's only 40 processes.

jeromekelleher commented 7 months ago

I can't reproduce this on 1000G - seems to work fine with maximal partitioning. There's quite a lot fewer variants though.

Is it just this particular partition that the error occurs on or are there others?

benjeffery commented 7 months ago

There were a few hundred that failed with this error - digging into this today.

benjeffery commented 7 months ago

Trying to understand what is going on here: In iter_values We end up with a chunk_offset of 29838. This is greater than the largest value (23135) in chunk_record_index so we end up with a start_chunk that is greater than the number of chunks, so chunks then returns an empty list.

benjeffery commented 7 months ago

I've also checked and it is 550 failing partitions spread non-contiguously from 1708-5985.

benjeffery commented 7 months ago

Ah-ha, wonder if this has something to do with it. Running dexplode-init again and check_overlap finds an overlap. That check was added after I exploded these VCFs.

jeromekelleher commented 7 months ago

Ahhhh

jeromekelleher commented 7 months ago

We shouldn't have overlaps here though, right? Can you give some details?

benjeffery commented 7 months ago

The list of filename, partition.region.start and partition.reigion.end looks like this: Screenshot from 2024-04-30 14-59-05

Really odd that a VCF in the middle contains 1. Checking the VCF to be sure.

jeromekelleher commented 7 months ago

A bunch of weird stuff here... are you sure the files are all part of the same set?

benjeffery commented 7 months ago

They are all in the same folder and have the same naming convention. The 58219159 file contains variants that start at that position.

benjeffery commented 6 months ago

So... that one VCF file, the CSIIndex has the first indexed position as 1. No other VCFs in the set have this. Cheking this is in fact the case in the CSI and not an issue in the index loader.

benjeffery commented 6 months ago

So this is the only index file that has a bin (bin 9) which is the first bin on it's level (level 2). Tracing this all the way back to the bytes in the CSI file confirms this. I'm re-indexing the file to see if this is the expected behavior.

jeromekelleher commented 6 months ago

What's the position of the first variant in that file?

benjeffery commented 6 months ago

58720256

jeromekelleher commented 6 months ago

OK, I'm rejigging a bunch of things to avoid using the record counts from the indexes. They're really not reliable, and not present in a lot of cases.

jeromekelleher commented 6 months ago

Hopefully close in #164. I'm going to push out a release in a few minutes so we can test @benjeffery