Open ErikPGJ opened 5 months ago
Problem should exist in below loop:
for ind_sweep = min_sweep:max_sweep
v1_ = zeros(128, 1);
p_punt = find(sweep_num == ind_sweep);
if ~isempty(p_punt)
for indband = 1:numel(p_punt)
idx_l = floor(32 * bande_e(p_punt(indband))) + 1;
idx_r = floor(32 * (bande_e(p_punt(indband)) + 1));
v1_(idx_l:idx_r) = auto_calib(p_punt(indband), :);
end
end
if sum(v1_) > 0.0 % Might be a way of detecting CDF fill values (large, negative).
punt0_ = find(v1_ == 0.0);
if ~isempty(punt0_)
v1_(punt0_) = NaN;
end
v_ = [v_, v1_];
sweepn_tnr = [sweep_tnr, sweep_num(p_punt(1))];
end
if ~isempty(p_punt)
time_ = [time_, time_rr(min(p_punt))];
end
end
In particular, the incrementally growing arrays sweepn_tnr
and time_
look suspicious.
Previous informal test of above loop:
1 day 13 s
2 days 45 s
3 days 121 s
Or maybe it does not scale badly after all?
Note: Underlying dataset used for testing have very similar sizes.
/data/solo/remote/data/L2/thr> ll -h 2022/12/solo_L2_rpw-tnr-surv-cdag_202212[23]* 2023/01/solo_L2_rpw-tnr-surv-cdag_2023010* | cut -b31-
326M 2023-11-13 14.57:52 2022/12/solo_L2_rpw-tnr-surv-cdag_20221220_V07.cdf
305M 2023-11-13 15.02:40 2022/12/solo_L2_rpw-tnr-surv-cdag_20221221_V10.cdf
328M 2023-11-13 15.07:52 2022/12/solo_L2_rpw-tnr-surv-cdag_20221222_V06.cdf
288M 2023-11-13 15.12:23 2022/12/solo_L2_rpw-tnr-surv-cdag_20221223_V09.cdf
328M 2023-11-13 15.17:32 2022/12/solo_L2_rpw-tnr-surv-cdag_20221224_V07.cdf
328M 2023-11-13 15.22:42 2022/12/solo_L2_rpw-tnr-surv-cdag_20221225_V06.cdf
327M 2023-11-13 15.27:54 2022/12/solo_L2_rpw-tnr-surv-cdag_20221226_V08.cdf
326M 2023-11-13 15.32:59 2022/12/solo_L2_rpw-tnr-surv-cdag_20221227_V07.cdf
327M 2023-11-13 15.38:15 2022/12/solo_L2_rpw-tnr-surv-cdag_20221228_V08.cdf
328M 2023-11-13 15.43:34 2022/12/solo_L2_rpw-tnr-surv-cdag_20221229_V07.cdf
328M 2023-11-13 15.48:49 2022/12/solo_L2_rpw-tnr-surv-cdag_20221230_V09.cdf
328M 2023-11-13 15.53:58 2022/12/solo_L2_rpw-tnr-surv-cdag_20221231_V09.cdf
328M 2023-11-13 11.44:33 2023/01/solo_L2_rpw-tnr-surv-cdag_20230101_V08.cdf
327M 2023-11-13 11.49:48 2023/01/solo_L2_rpw-tnr-surv-cdag_20230102_V08.cdf
326M 2023-11-13 11.54:59 2023/01/solo_L2_rpw-tnr-surv-cdag_20230103_V08.cdf
220M 2023-11-13 11.58:33 2023/01/solo_L2_rpw-tnr-surv-cdag_20230104_V07.cdf
328M 2023-11-13 12.03:49 2023/01/solo_L2_rpw-tnr-surv-cdag_20230105_V06.cdf
328M 2023-11-13 12.09:02 2023/01/solo_L2_rpw-tnr-surv-cdag_20230106_V06.cdf
328M 2023-11-13 12.14:22 2023/01/solo_L2_rpw-tnr-surv-cdag_20230107_V05.cdf
328M 2023-11-13 12.19:36 2023/01/solo_L2_rpw-tnr-surv-cdag_20230108_V05.cdf
277M 2023-11-13 12.24:08 2023/01/solo_L2_rpw-tnr-surv-cdag_20230109_V06.cdf
Profiling implies that the problem is something completely different:
profile on; solo.read_TNR(irf.tint('2022-12-28T00:00:00Z/2022-12-29T00:00:00Z')); profile off; profile viewer
Indeed, the command
p_punt = find(sweep_num == ind_sweep);
does seem to take most of the time, even if commenting out everything else in the loop.
From analyzing the code, it seems possible to speed up the code somewhat (factor of ~2?) by making below replacement:
%for ind_sweep = min_sweep:max_sweep
for ind_sweep = unique(sweep_num)' % NOTE: "sweep_num = sweep_num(sens_)" above.
(Not tested if the resulting data is the same though.)
Speculative idea: Could maybe parallelize the loop (parfor)?
Seems one could simplify (not speed up)
punt0_ = find(v1_ == 0.0);
if ~isempty(punt0_)
v1_(punt0_) = NaN;
end
to
v_1(v1_==0) = NaN;
.
Seems one could almost simplify iteration (not speed up) to use only one if statement since
isempty(p_punt)
==> ~(sum(v1_) > 0.0)
.
Footnote: Not implementing these ideas myself since (1) I do not want to interfere with Jordi's work on read_TNR()
, and (2) because I do not have time to actually check if it works well myself right now. Just collecting ideas.
Changing
punt0 = find(v1 == 0.0); if ~isempty(punt0) v1(punt0_) = NaN; end
to
v1(v1==0) = NaN;
should work. The code was translated 'literally' from IDL so there are some implementations that are not optimized for matlab.
Yes, now when I looked more carefully for it, that sounds it could be a "bad translation".
I did not really intend to analyze read_TNR()
for bugs (since that is more for https://github.com/irfu/irfu-matlab/issues/138), just tried to understand what the loop does in my head. I also note that sweep_tnr
and sweepn_tnr
are not used for anything. Is it both a typo (should be one and same variable?) and variable(s) that should have been removed?
Step 1: Latest code?
git pull
this is still a problem). Should be present in latest commit on devel branch:b679b71a Erik P G Johansson (2024-04-08 12:46:41 +0200)
Step 2: Describe your environment
master
,devel
): develMatlab R2020b
): MATLAB R2023bWindows 10
,Mac OS X 10.15.6
,Linux Ubuntu 20.04
): Ubuntu 22.04.4 LTSStep 3: Describe the problem
solo.read_TNR()
appears to have performance issues. Wall time spent in function increases faster than linearly with length of time interval of data used for it (the argument).Steps to reproduce:
Simple code for visualizing behaviour.