Reduce memory usage in process_centr_segment

LachlanStuart commented 4 years ago

See individual commit messages.

I still need to check that the results are still valid. They're pretty close, but I have to check why a few discrepancies are reported.

gilv commented 4 years ago

@omerb01 we still need this one?

omerb01 commented 4 years ago

@gilv yes, to get more precise control on annotate method's memory

LachlanStuart commented 4 years ago

@omerb01 I believe the changes in this PR aren't compatible with the changes from your PR. As I described in Slack, this PR requires centr_df to be iterated in mz order so that dataset segments can be lazily loaded, and your PR changed it to be iterated in formula_i order so that images are produced in full sets, so that they don't need to be kept in memory for very long.

If you feel there's value in keeping this implementation, I'll make an alternate function for it and add a keyword param for switching between the two strategies. However, I believe that the worst-case memory usage of your implementation should still fit in 2GB of memory for the 60GB dataset, so I'm not sure if it's necessary...

omerb01 commented 4 years ago

@LachlanStuart I would like that any dataset in any size won't cause OOM in future. do you think it isn't possible to iterate over dataset segments and also to get all images related for a specific formula each time? I sorted the database by formula_i only for collecting same formulas images, the order doesn't matter. a leading question: can we find images related to the same formula in different dataset segments?

LachlanStuart commented 4 years ago

@omerb01 If we look at the huge4 dataset, the dataset segments are split like this:

segment #	lowest mz	highest mz
0	198.010676	200.958094
1	200.958094	201.091571
2	201.091571	202.094092
3	202.094092	202.990456
4	202.990456	203.087088
5	203.087088	203.980149
6	203.980149	204.434332
7	204.434332	204.989594
8	204.989594	205.083574
9	205.083574	205.880832
10	205.880832	206.001647
11	206.001647	206.982079

A typical formula will have 4 peaks, which are distributed something like this, when sorted by the mz of peak 0:

`formula_i`	peak 0 `mz`	peak 1 `mz`	peak 2 `mz`	peak 3 `mz`
237106	200.006899	201.010312	202.004920	203.008290
145669	200.006987	201.006541	201.010394	202.003257
176945	200.007449	202.005614	204.004776	206.004942
620657	200.008169	202.003564	203.006799	204.007728

Or if you map them to segments numbers:

`formula_i`	peak 1 segment	peak 2segment	peak 3 segment
237106	1	2	4
145669	1	2	2
176945	2	6	11
620657	2	4	6

To evaluate all peaks for one formula, it's necessary to load up to 4 separate dataset segments (technically up to 8 because sometimes the mz +/- 3ppm range will sit on the border between two segments). Although we could spend some time optimizing the order that formulas are iterated so that we get all formulas need e.g. 0,1,2,4 in one batch, then 0,2,6,11 in another batch, it would increase the I/O significantly, because segments would have to be unloaded before they've finished being used, e.g. with the above formulas:

Load segments 0,1,2,4 to process the 0,1,2,4 and 0,1,2,2 batches of formulas
Unload segment 4 and load segments 6,11 to process 0,2,6,11 batch
Unload segment 11 and wastefully reload segment 4 to process the 0,2,4,6 batch

I feel a lot of time could be spent on trying to optimize that the processing order, but it would still be slow due to excess I/O.

There's also a third option - capture images in mz order and save them to COS in chunks based on formula_i, then calculate the metrics in a separate step that doesn't load the dataset segments at all. Although this would scale to much bigger datasets than either approach so far, it would probably also be slower due to the extra I/O.

omerb01 commented 4 years ago

due to the complexity of reaching this target, I think we can rely on #54 and close this one.

metaspace2020 / Lithops-METASPACE

Reduce memory usage in process_centr_segment #49