Open ShyamieG opened 1 year ago
Very interesting @ShyamieG! Can you provide us with a bit more info please:
tskit info
for one of these ts files.Hi there, sorry for the delay. Thanks for helping me with this!
Here's the tskit info from your file:
╔════════════════════════╗
║TreeSequence ║
╠═══════════════╤════════╣
║Trees │ 7871║
╟───────────────┼────────╢
║Sequence Length│24214675║
╟───────────────┼────────╢
║Time Units │ ticks║
╟───────────────┼────────╢
║Sample Nodes │ 5098║
╟───────────────┼────────╢
║Total Size │ 3.1 GiB║
╚═══════════════╧════════╝
╔═══════════╤═════╤═════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪═════╪═════════╪════════════╣
║Edges │37085│ 1.1 MiB│ No║
╟───────────┼─────┼─────────┼────────────╢
║Individuals│ 4958│486.0 KiB│ Yes║
╟───────────┼─────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼─────┼─────────┼────────────╢
║Mutations │ 1093│ 65.9 KiB│ Yes║
╟───────────┼─────┼─────────┼────────────╢
║Nodes │ 8152│303.2 KiB│ Yes║
╟───────────┼─────┼─────────┼────────────╢
║Populations│ 4│ 2.4 KiB│ Yes║
╟───────────┼─────┼─────────┼────────────╢
║Provenances│ 232│ 3.1 GiB│ No║
╟───────────┼─────┼─────────┼────────────╢
║Sites │ 680│ 16.0 KiB│ No║
╚═══════════╧═════╧═════════╧════════════╝
That's a heck of a lot of provenance data (3.1 GiB!) - I wonder what's going on there? It's very unlikely this is of any use, so it's worth figuring out where it came from and stopping it from happening.
I think that must be the tszip issue, the codec can't handle columns of length > 2G.
Ah, okay, got it. Any tips on how I can go about figuring out what all of this stuff is?
I guess the first step would be to look at the provenances, like e.g.
tskit provenances <file> | less -S
I'm not sure how well it'll deal with having a 2G record though.
Awesome, thank you! I took a look at one of my smaller files and saw that there is indeed a lot of redundant information. These trees are the result of merging several other trees, so that is part of it. I set record_provenance to False in my tskit.union() call to mitigate this somewhat.
However, the other issue is that my files are being produced as a result of passing a tree sequence from one SLiM script to another dozens to hundreds of times. This also results in a lot of redundant information being stored.
Is is possible to simply delete certain kinds of provenance information entirely? For example, I don't need to store information about the SLiM model or parameters with the ts file. Any problems with deleting this kind of information that I should be aware of?
For context, I'm still working on the same general problem that I describe in this post some months ago.
Easiest thing to do is just drop the provenance info entirely by truncating the provenance table. It's unlikely to have any effect on things working, as code shouldn't really be depending on the contents.
The docs here might help https://tskit.dev/tskit/docs/stable/provenance.html
I'm going to keep this one open @ShyamieG as it is genuinely a bug in tszip. At a minimum we should emit a better error message saying what the problem is.
I am experiencing an error when trying to compress certain tree sequence files:
ValueError: Codec does not support buffers of > 2147483647 bytes
It seems that this error is originating from some function in the zarr package related to chunking? This occurs with both python and command-line versions of tszip.
I can't upload an example here because even the gzipped version of my file is too large (72.3MB).
Any insight into why this is happening and how I might resolve it?