Follow-ups on compression analysis

jeromekelleher commented 5 months ago

I've had a good look at the compression ratio analysis @shz9 and it's great, thanks!

I think the main issue is that we need to abstract out the data presentation a bit, and try to make it easier for readers to see the general trends. Currently, it takes quite a lot of study to get general conclusions as there's a lot of very similar plots.

Here's some points, in no particular order

There's no point in showing plots for arrays that are all of one value (call_genotype_phased or call_SB, say). The story here is clear, bigger chunk sizes give better compression, for obvious reasons. But compression is great anyway, so that wouldn't guide anyone's decision about chunk size choice. Let's just leave these out.
The general trend with chunk size seems to be the same across all the arrays, so maybe we could take a more detailed look at how varying sample chunk size and variant chunk size affects compression for a few representative arrays? Say, call_genotype, call_AB and call_GQ (covering the range of compressability, I think?)
I think we can abstract the shuffle settings a bit as well - basically, BitShuffle is only going to help when you have very small integers, and most of the bits of a item are zero. ByteShuffle is similar, I guess, where most values are of a similar magnitude (call_DP is a good example). Perhaps we could do a simpler standalone figure on the effects of shuffle on a the various arrays at a single chunk size?
I'm not sure it's worth getting into the dimension shuffling thing - we'll have to explain what it is, and then also explain that it's not supported. Given that it doesn't have that strong an effect, I suggest we drop it?
The comparison of PackBits and BitShuffle on boolean fields is interesting. I guess it's not surprising the BitShuffle basically does as well, because the 7 unused bits across all the items will be zero, and compress extremely well. The remaining bit that's actually used is essentially what you get when you bit pack anyway. I don't think we need a figure for this, and can just report the number in the text?

OK, so I suggest two figures:

Effect of bit and byte shuffle on some representative arrays at 10Kx1K chunk size (call_genotype, call_AB and call_GQ, call_DP). I guess a categorical bar plot would be good for this?
Effect of sample and variant chunk size on these arrays using their best shuffle settings, varying sample and variant chunk size in (say) 20 increments. One way to do this would be a two-panel figure in which the x axes are the variant and sample chunk sizes, holding the other chunk size fixed at (say) 1000, and we plot compression ratio as a line.

jeromekelleher commented 5 months ago

I added the first of these figures in #109 based on the existing data:

tmp

jeromekelleher commented 5 months ago

For the second plot, here's a rough version based on current data:

fig, axes = plt.subplots(1, 2) 

array_shuffle = {'call_GQ': 0, 'call_DP':1,  'call_AD':1,  'call_AB':0, 'call_genotype':2}

for array, shuffle in array_shuffle.items():
    dfs = df_sub[(df_sub.ArrayName == array) & (df_sub.shuffle == shuffle) & (df_sub.variant_chunksize == 10000)]
    axes[0].plot(dfs.sample_chunksize, dfs.CompressionRatio, label=array)
    dfs = df_sub[(df_sub.ArrayName == array) & (df_sub.shuffle == shuffle) & (df_sub.sample_chunksize == 1000)]
    axes[1].plot(dfs.variant_chunksize, dfs.CompressionRatio, label=array)    
axes[0].legend()
axes[0].set_xlabel("Sample chunk size")
axes[1].set_xlabel("Variant chunk size")
axes[0].set_ylabel("Compression ratio");

Screenshot from 2024-05-09 13-30-55

So, basically we just want to fill in some points on the x axes here, holding the other chunk size fixed at the vcf2zarr default. The idea is to give some sort of intuition what effect varying these two chunk sizes has on compression of various fields.

It's probably simplest if we do a separate command to collect this data and store it in its own CSV.

jeromekelleher commented 5 months ago

@shz9 do you think you could update the code to get data for this figure? I think it's quite straightforward, just a case of simplifying the current analysis really.

shz9 commented 5 months ago

Yes, sounds good. I agree with most of your points and I will try and push a more simplified analysis/figures in the next couple of days.

sgkit-dev / vcf-zarr-publication

Follow-ups on compression analysis #108