metaspace2020 / Lithops-METASPACE

Lithops-based Serverless implementation of the METASPACE spatial metabolomics annotation pipeline
12 stars 4 forks source link

Reduce memory usage in process_centr_segment #49

Closed LachlanStuart closed 4 years ago

LachlanStuart commented 4 years ago

See individual commit messages.

I still need to check that the results are still valid. They're pretty close, but I have to check why a few discrepancies are reported.

gilv commented 4 years ago

@omerb01 we still need this one?

omerb01 commented 4 years ago

@gilv yes, to get more precise control on annotate method's memory

LachlanStuart commented 4 years ago

@omerb01 I believe the changes in this PR aren't compatible with the changes from your PR. As I described in Slack, this PR requires centr_df to be iterated in mz order so that dataset segments can be lazily loaded, and your PR changed it to be iterated in formula_i order so that images are produced in full sets, so that they don't need to be kept in memory for very long.

If you feel there's value in keeping this implementation, I'll make an alternate function for it and add a keyword param for switching between the two strategies. However, I believe that the worst-case memory usage of your implementation should still fit in 2GB of memory for the 60GB dataset, so I'm not sure if it's necessary...

omerb01 commented 4 years ago

@LachlanStuart I would like that any dataset in any size won't cause OOM in future. do you think it isn't possible to iterate over dataset segments and also to get all images related for a specific formula each time? I sorted the database by formula_i only for collecting same formulas images, the order doesn't matter. a leading question: can we find images related to the same formula in different dataset segments?

LachlanStuart commented 4 years ago

@omerb01 If we look at the huge4 dataset, the dataset segments are split like this:

segment # lowest mz highest mz
0 198.010676 200.958094
1 200.958094 201.091571
2 201.091571 202.094092
3 202.094092 202.990456
4 202.990456 203.087088
5 203.087088 203.980149
6 203.980149 204.434332
7 204.434332 204.989594
8 204.989594 205.083574
9 205.083574 205.880832
10 205.880832 206.001647
11 206.001647 206.982079

A typical formula will have 4 peaks, which are distributed something like this, when sorted by the mz of peak 0:

formula_i peak 0 mz peak 1 mz peak 2 mz peak 3 mz
237106 200.006899 201.010312 202.004920 203.008290
145669 200.006987 201.006541 201.010394 202.003257
176945 200.007449 202.005614 204.004776 206.004942
620657 200.008169 202.003564 203.006799 204.007728

Or if you map them to segments numbers:

formula_i peak 0 segment peak 1 segment peak 2segment peak 3 segment
237106 0 1 2 4
145669 0 1 2 2
176945 0 2 6 11
620657 0 2 4 6

To evaluate all peaks for one formula, it's necessary to load up to 4 separate dataset segments (technically up to 8 because sometimes the mz +/- 3ppm range will sit on the border between two segments). Although we could spend some time optimizing the order that formulas are iterated so that we get all formulas need e.g. 0,1,2,4 in one batch, then 0,2,6,11 in another batch, it would increase the I/O significantly, because segments would have to be unloaded before they've finished being used, e.g. with the above formulas:

I feel a lot of time could be spent on trying to optimize that the processing order, but it would still be slow due to excess I/O.

There's also a third option - capture images in mz order and save them to COS in chunks based on formula_i, then calculate the metrics in a separate step that doesn't load the dataset segments at all. Although this would scale to much bigger datasets than either approach so far, it would probably also be slower due to the extra I/O.

omerb01 commented 4 years ago

due to the complexity of reaching this target, I think we can rely on #54 and close this one.