Closed LachlanStuart closed 4 years ago
@omerb01 we still need this one?
@gilv yes, to get more precise control on annotate
method's memory
@omerb01 I believe the changes in this PR aren't compatible with the changes from your PR. As I described in Slack, this PR requires centr_df
to be iterated in mz
order so that dataset segments can be lazily loaded, and your PR changed it to be iterated in formula_i
order so that images are produced in full sets, so that they don't need to be kept in memory for very long.
If you feel there's value in keeping this implementation, I'll make an alternate function for it and add a keyword param for switching between the two strategies. However, I believe that the worst-case memory usage of your implementation should still fit in 2GB of memory for the 60GB dataset, so I'm not sure if it's necessary...
@LachlanStuart I would like that any dataset in any size won't cause OOM in future.
do you think it isn't possible to iterate over dataset segments and also to get all images related for a specific formula each time?
I sorted the database by formula_i
only for collecting same formulas images, the order doesn't matter.
a leading question: can we find images related to the same formula in different dataset segments?
@omerb01 If we look at the huge4
dataset, the dataset segments are split like this:
segment # | lowest mz | highest mz |
---|---|---|
0 | 198.010676 | 200.958094 |
1 | 200.958094 | 201.091571 |
2 | 201.091571 | 202.094092 |
3 | 202.094092 | 202.990456 |
4 | 202.990456 | 203.087088 |
5 | 203.087088 | 203.980149 |
6 | 203.980149 | 204.434332 |
7 | 204.434332 | 204.989594 |
8 | 204.989594 | 205.083574 |
9 | 205.083574 | 205.880832 |
10 | 205.880832 | 206.001647 |
11 | 206.001647 | 206.982079 |
A typical formula will have 4 peaks, which are distributed something like this, when sorted by the mz of peak 0:
formula_i |
peak 0 mz |
peak 1 mz |
peak 2 mz |
peak 3 mz |
---|---|---|---|---|
237106 | 200.006899 | 201.010312 | 202.004920 | 203.008290 |
145669 | 200.006987 | 201.006541 | 201.010394 | 202.003257 |
176945 | 200.007449 | 202.005614 | 204.004776 | 206.004942 |
620657 | 200.008169 | 202.003564 | 203.006799 | 204.007728 |
Or if you map them to segments numbers:
formula_i |
peak 0 segment | peak 1 segment | peak 2segment | peak 3 segment |
---|---|---|---|---|
237106 | 0 | 1 | 2 | 4 |
145669 | 0 | 1 | 2 | 2 |
176945 | 0 | 2 | 6 | 11 |
620657 | 0 | 2 | 4 | 6 |
To evaluate all peaks for one formula, it's necessary to load up to 4 separate dataset segments (technically up to 8 because sometimes the mz +/- 3ppm range will sit on the border between two segments). Although we could spend some time optimizing the order that formulas are iterated so that we get all formulas need e.g. 0,1,2,4
in one batch, then 0,2,6,11
in another batch, it would increase the I/O significantly, because segments would have to be unloaded before they've finished being used, e.g. with the above formulas:
0,1,2,4
to process the 0,1,2,4
and 0,1,2,2
batches of formulas4
and load segments 6,11
to process 0,2,6,11
batch11
and wastefully reload segment 4
to process the 0,2,4,6
batchI feel a lot of time could be spent on trying to optimize that the processing order, but it would still be slow due to excess I/O.
There's also a third option - capture images in mz
order and save them to COS in chunks based on formula_i
, then calculate the metrics in a separate step that doesn't load the dataset segments at all. Although this would scale to much bigger datasets than either approach so far, it would probably also be slower due to the extra I/O.
due to the complexity of reaching this target, I think we can rely on #54 and close this one.
See individual commit messages.
I still need to check that the results are still valid. They're pretty close, but I have to check why a few discrepancies are reported.