Alternative approach for dataset segmentation

metaspace2020 / Lithops-METASPACE

Lithops-based Serverless implementation of the METASPACE spatial metabolomics annotation pipeline

12 stars 4 forks source link

Alternative approach for dataset segmentation #68

Closed omerb01 closed 4 years ago

omerb01 commented 4 years ago

I added a script that can run locally and separately from the project. the new approach is intended to replace the current segmentation approach with Cloud Functions. I would like to test this script on a virtual machine, thus I need a review that everything is correct and up to date.

the script is hard coded configured to segment the "small" dataset into segments of 5MB each, and it applies the following:

downloading imzML and ibd files
defining segments bounds
segmenting each dataset chunk
uploading final segments

gilv commented 4 years ago

@LachlanStuart this suppose to be your original sort + new improvements you added recently. We will use this code to deploy in VM to perform sort. Can you please review this code is correct one and we didn't missed anything? We need to be sure it's correct one for comparison

omerb01 commented 4 years ago

@LachlanStuart Thanks

LachlanStuart commented 4 years ago

I just wanted to add: regarding "up to date", it matches the serverless implementation. However, there's an optimization in the Serverful implementation to reduce the size of segments by ~33% by using a DataFrame instead of ndarray so that each column can have a different data format: https://github.com/metaspace2020/metaspace/blob/master/metaspace/engine/sm/engine/msm_basic/segmenter.py#L107-L116

I wasn't able to add this to the serverless implementation as there were too many places that needed to be changed to get it to work due to the many phases of dataset sorting.

This will be tested with the serverless annotate functions, right? If so, I think we should wait for the full pipeline to be ready before trying to implement it. It's a relatively small optimization (reduces cost by ~4%), and it would require the segment sizes to be re-tuned because having the rows more tightly packed in each segment would mean fewer segments & fewer parallel invocations.

omerb01 commented 4 years ago

@LachlanStuart yes, it is intended to be tested with the other serverless functions. just for clarification - does this optimisation require to change the logic of other parts? if so, I agree with you that we should wait with this optimisation.

LachlanStuart commented 4 years ago

@omerb01 Yes. It requires changes everywhere in code that uses ds_segms_cobjects, as the syntax for accessing DataFrames is different to ndarrays.

gilv commented 4 years ago

@LachlanStuart so we compare serverfull which is about 30% more optimized with the new code to serverlss code which is not optimized?

LachlanStuart commented 4 years ago

@gilv If we want to benchmark this vs Serverful and Serverless, we can use the following hack to make the size of the workloads match. However, this is not a full implementation - the benchmark timings will be valid for comparison, but the pipeline's output data will be invalid because the results will be much less precise.

After this line, add another line:

DECOY_ADDUCTS = DECOY_ADDUCTS[::4]

A full implementation requires that for each formula in the database, 20 out of the 80 DECOY_ADDUCTS are randomly selected. Because each formula gets a different random selection of adducts, it's necessary to pre-calculate them while building the databases, and keep that data until the FDR stage so that the FDR knows which adducts were selected for each formula. This change would be much easier if the DB creation & segmentation wasn't split over many functions.

gilv commented 4 years ago

@LachlanStuart @omerb01 but now, when we leverage also VM, then we don't split db creation and segmentation across many functions, right? Or i missing something obvious?

LachlanStuart commented 4 years ago

@gilv Correct, and it will be much easier for me to implement that change once the molecular DB segmentation code is converted. However so this PR only converts the DS code, not the DB code, to use VMs.

LachlanStuart commented 4 years ago

@gilv Oh, did you mean the 33%-smaller DataFrame change? Sorry, I thought you were talking about the DECOY_ADDUCTS change which makes a much bigger difference in the benchmark. The 33%-smaller DataFrame change makes a very small difference in overall speed & cost - approximately 4%.

Yes, it's much easier to make that change with this code, but it also requires changes to the other pipeline stages in a way that would be incompatible with the existing code. We could do that, but IMO it's not worth it until we've finished the VM-based pipeline.