sgkit-dev / vcf-zarr-publication

Manuscript and associated scripts for vcf-zarr publication
2 stars 7 forks source link

Follow-ups on compression analysis #108

Closed jeromekelleher closed 4 months ago

jeromekelleher commented 5 months ago

I've had a good look at the compression ratio analysis @shz9 and it's great, thanks!

I think the main issue is that we need to abstract out the data presentation a bit, and try to make it easier for readers to see the general trends. Currently, it takes quite a lot of study to get general conclusions as there's a lot of very similar plots.

Here's some points, in no particular order

OK, so I suggest two figures:

jeromekelleher commented 5 months ago

I added the first of these figures in #109 based on the existing data:

tmp

jeromekelleher commented 5 months ago

For the second plot, here's a rough version based on current data:

fig, axes = plt.subplots(1, 2) 

array_shuffle = {'call_GQ': 0, 'call_DP':1,  'call_AD':1,  'call_AB':0, 'call_genotype':2}

for array, shuffle in array_shuffle.items():
    dfs = df_sub[(df_sub.ArrayName == array) & (df_sub.shuffle == shuffle) & (df_sub.variant_chunksize == 10000)]
    axes[0].plot(dfs.sample_chunksize, dfs.CompressionRatio, label=array)
    dfs = df_sub[(df_sub.ArrayName == array) & (df_sub.shuffle == shuffle) & (df_sub.sample_chunksize == 1000)]
    axes[1].plot(dfs.variant_chunksize, dfs.CompressionRatio, label=array)    
axes[0].legend()
axes[0].set_xlabel("Sample chunk size")
axes[1].set_xlabel("Variant chunk size")
axes[0].set_ylabel("Compression ratio");

Screenshot from 2024-05-09 13-30-55

So, basically we just want to fill in some points on the x axes here, holding the other chunk size fixed at the vcf2zarr default. The idea is to give some sort of intuition what effect varying these two chunk sizes has on compression of various fields.

It's probably simplest if we do a separate command to collect this data and store it in its own CSV.

jeromekelleher commented 5 months ago

@shz9 do you think you could update the code to get data for this figure? I think it's quite straightforward, just a case of simplifying the current analysis really.

shz9 commented 5 months ago

Yes, sounds good. I agree with most of your points and I will try and push a more simplified analysis/figures in the next couple of days.