Closed omerb01 closed 4 years ago
@omerb01 That value for total_n_mz
is close to the value I'd expect. The .ibd
file holds 16 bytes per entry (a 64-bit mz
and a 64-bit int
), so I'd expect total_n_mz
to be approx 50GB / 16B = 3,125,000,000.
segm_n
is also close to the value I'd expect - the dataset should become ~50% bigger after being unpacked, and 869 * 100MB = 86.9GB.
@LachlanStuart I found the bug ;) will post a PR soon
regarding https://gist.github.com/LachlanStuart/8f79fd87f0783d55e32fd3dde52fa318 I suspect that
define_ds_segments()
doesn't work as it should onhuge4
dataset. https://github.com/metaspace2020/pywren-annotation-pipeline/blob/4afab539a71ae1fa83026cb6d270ab4a33b93fcb/annotation_pipeline/segment.py#L84 at this stage, we approximate the number of mz values by sampling the dataset - specifically onhuge4
dataset,define_ds_segments()
calculates:I think that
total_n_mz
should be greater and then each segment will be smaller as it should - with all other datasets, all sub-segments (that generated by "first" segmentation mechanism) sum up to 100MB (as defined inds_segm_size_mb
) and here we observe that all sub-segments sum up to ~2.5GB.@LachlanStuart can you confirm that the
total_n_mz
should be greater?