define_ds_segments() on huge4 issue

omerb01 commented 4 years ago

regarding https://gist.github.com/LachlanStuart/8f79fd87f0783d55e32fd3dde52fa318 I suspect that define_ds_segments() doesn't work as it should on huge4 dataset. https://github.com/metaspace2020/pywren-annotation-pipeline/blob/4afab539a71ae1fa83026cb6d270ab4a33b93fcb/annotation_pipeline/segment.py#L84 at this stage, we approximate the number of mz values by sampling the dataset - specifically on huge4 dataset, define_ds_segments() calculates:

ds_segm_size_mb = 100
total_n_mz = 3799052080
segm_n = 869

I think thattotal_n_mz should be greater and then each segment will be smaller as it should - with all other datasets, all sub-segments (that generated by "first" segmentation mechanism) sum up to 100MB (as defined in ds_segm_size_mb) and here we observe that all sub-segments sum up to ~2.5GB.

@LachlanStuart can you confirm that the total_n_mz should be greater?

LachlanStuart commented 4 years ago

@omerb01 That value for total_n_mz is close to the value I'd expect. The .ibd file holds 16 bytes per entry (a 64-bit mz and a 64-bit int), so I'd expect total_n_mz to be approx 50GB / 16B = 3,125,000,000.

segm_n is also close to the value I'd expect - the dataset should become ~50% bigger after being unpacked, and 869 * 100MB = 86.9GB.

omerb01 commented 4 years ago

@LachlanStuart I found the bug ;) will post a PR soon

metaspace2020 / Lithops-METASPACE

define_ds_segments() on huge4 issue #58