Closed jeromekelleher closed 4 months ago
I added the first of these figures in #109 based on the existing data:
For the second plot, here's a rough version based on current data:
fig, axes = plt.subplots(1, 2)
array_shuffle = {'call_GQ': 0, 'call_DP':1, 'call_AD':1, 'call_AB':0, 'call_genotype':2}
for array, shuffle in array_shuffle.items():
dfs = df_sub[(df_sub.ArrayName == array) & (df_sub.shuffle == shuffle) & (df_sub.variant_chunksize == 10000)]
axes[0].plot(dfs.sample_chunksize, dfs.CompressionRatio, label=array)
dfs = df_sub[(df_sub.ArrayName == array) & (df_sub.shuffle == shuffle) & (df_sub.sample_chunksize == 1000)]
axes[1].plot(dfs.variant_chunksize, dfs.CompressionRatio, label=array)
axes[0].legend()
axes[0].set_xlabel("Sample chunk size")
axes[1].set_xlabel("Variant chunk size")
axes[0].set_ylabel("Compression ratio");
So, basically we just want to fill in some points on the x axes here, holding the other chunk size fixed at the vcf2zarr default. The idea is to give some sort of intuition what effect varying these two chunk sizes has on compression of various fields.
It's probably simplest if we do a separate command to collect this data and store it in its own CSV.
@shz9 do you think you could update the code to get data for this figure? I think it's quite straightforward, just a case of simplifying the current analysis really.
Yes, sounds good. I agree with most of your points and I will try and push a more simplified analysis/figures in the next couple of days.
I've had a good look at the compression ratio analysis @shz9 and it's great, thanks!
I think the main issue is that we need to abstract out the data presentation a bit, and try to make it easier for readers to see the general trends. Currently, it takes quite a lot of study to get general conclusions as there's a lot of very similar plots.
Here's some points, in no particular order
OK, so I suggest two figures: