Cannot handle columns > 2GiB

ShyamieG commented 1 year ago

I am experiencing an error when trying to compress certain tree sequence files:

ValueError: Codec does not support buffers of > 2147483647 bytes

It seems that this error is originating from some function in the zarr package related to chunking? This occurs with both python and command-line versions of tszip.

I can't upload an example here because even the gzipped version of my file is too large (72.3MB).

Any insight into why this is happening and how I might resolve it?

jeromekelleher commented 1 year ago

Very interesting @ShyamieG! Can you provide us with a bit more info please:

The full stack trace from your ValueError above
The output of tskit info for one of these ts files.

ShyamieG commented 1 year ago

Hi there, sorry for the delay. Thanks for helping me with this!

Here is the result of running tszip on the command line.

And the result of tskit info on the offending ts.

jeromekelleher commented 1 year ago

Here's the tskit info from your file:

╔════════════════════════╗
║TreeSequence            ║
╠═══════════════╤════════╣
║Trees          │    7871║
╟───────────────┼────────╢
║Sequence Length│24214675║
╟───────────────┼────────╢
║Time Units     │   ticks║
╟───────────────┼────────╢
║Sample Nodes   │    5098║
╟───────────────┼────────╢
║Total Size     │ 3.1 GiB║
╚═══════════════╧════════╝
╔═══════════╤═════╤═════════╤════════════╗
║Table      │Rows │Size     │Has Metadata║
╠═══════════╪═════╪═════════╪════════════╣
║Edges      │37085│  1.1 MiB│          No║
╟───────────┼─────┼─────────┼────────────╢
║Individuals│ 4958│486.0 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Migrations │    0│  8 Bytes│          No║
╟───────────┼─────┼─────────┼────────────╢
║Mutations  │ 1093│ 65.9 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Nodes      │ 8152│303.2 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Populations│    4│  2.4 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Provenances│  232│  3.1 GiB│          No║
╟───────────┼─────┼─────────┼────────────╢
║Sites      │  680│ 16.0 KiB│          No║
╚═══════════╧═════╧═════════╧════════════╝

That's a heck of a lot of provenance data (3.1 GiB!) - I wonder what's going on there? It's very unlikely this is of any use, so it's worth figuring out where it came from and stopping it from happening.

I think that must be the tszip issue, the codec can't handle columns of length > 2G.

ShyamieG commented 1 year ago

Ah, okay, got it. Any tips on how I can go about figuring out what all of this stuff is?

jeromekelleher commented 1 year ago

I guess the first step would be to look at the provenances, like e.g.

tskit provenances <file> | less -S

I'm not sure how well it'll deal with having a 2G record though.

ShyamieG commented 1 year ago

Awesome, thank you! I took a look at one of my smaller files and saw that there is indeed a lot of redundant information. These trees are the result of merging several other trees, so that is part of it. I set record_provenance to False in my tskit.union() call to mitigate this somewhat.

However, the other issue is that my files are being produced as a result of passing a tree sequence from one SLiM script to another dozens to hundreds of times. This also results in a lot of redundant information being stored.

Is is possible to simply delete certain kinds of provenance information entirely? For example, I don't need to store information about the SLiM model or parameters with the ts file. Any problems with deleting this kind of information that I should be aware of?

For context, I'm still working on the same general problem that I describe in this post some months ago.

jeromekelleher commented 1 year ago

Easiest thing to do is just drop the provenance info entirely by truncating the provenance table. It's unlikely to have any effect on things working, as code shouldn't really be depending on the contents.

The docs here might help https://tskit.dev/tskit/docs/stable/provenance.html

jeromekelleher commented 1 year ago

I'm going to keep this one open @ShyamieG as it is genuinely a bug in tszip. At a minimum we should emit a better error message saying what the problem is.

tskit-dev / tszip

Cannot handle columns > 2GiB #69