impute only as many values as needed, not 201

l-k- commented 10 months ago

We were always imputing 201 values instead of whatever number was needed.

We have a couple of test files where this had a particularly bad effect -- there was an odd chunk with abnormally high frequency (twice the expected frequency), and when we imputed it with 200 values instead of ~50 that we should have, we went completely past the next chunk, so that next chunk then ended up with a negative frequency and got imputed as well, and that domino effect continued for over 1000 chunks.

vincentvanhees commented 10 months ago

Thanks Lena.

In line 573 to 580 we update object last to account for the scenario where a block reflects a larger time window than we expect, by incrementing last. I see your update relates to accounting for the opposite scenario where a block lasts less than the expected time window (more values in a block than we expect).

Your fix focuses specifically on the scenario impute == TRUE. I am now wondering whether this also should apply to impute == FALSE. Would it not be better to address this at around line 581? For example with:

if (rawTime[rawLast] < timeRes[last]) {
      # block is shorter than expected
      timecut = timeRes[last] - rawTime[rawLast]
      positions2subtract = floor(cut * prevRaw$frequency)
      last = last - positions2subtract
      if (last <= pos) last = pos + 1 #  to catch extreme scenario of time standing still or reversing
}

Following the above, I am wondering whether this means that frequency_observed as calculated higher up in the function is still correct? I think it is correct, but could you double check?
Which of the test files did the problem occur in? It may be helpful if I double checked that new output looks plausible.

l-k- commented 10 months ago

I think this is not an issue when impute == FALSE because the tmp object returned by resample() will not have more samples in it than needed.

Here resample.cpp determines how many samples / time points it needs to return.

In ReadAxivity, we start with 201 time points, and then use lines 573-580 to make sure that we give resample.cpp enough time points to work with, but it's ok to give resample.cpp more time points than needed, it'll only use what's needed.
I think frequency_observed calculation is correct. Before the fix in this PR, frequency_observed was ending up incorrect for a block after an imputed block, because rawTime[1] as assigned here (right after an imputation) was incorrect. But the formula for frequency_observed itself is fine I think, and as long as rawTime[1] is correct, the frequency is calculated correctly.
The test file I used to debug this is one of the 6 files you shared with me, an Axivity cwa file with a gap in it, and it's the file with "RT" in the name (let me know if I should email you the full file name or if it's ok to post it here). With this fix it takes about a minute to process that file, but without the fix it takes over an hour and ends up with > 240K of unnecessarily imputed blocks.

(I thought there was one more file with this same issue, but I can't find it, I must have tested with this one file twice in a row without realizing it.)

codecov-commenter commented 10 months ago

Codecov Report

Merging #49 (31e0238) into main (9f786f6) will not change coverage. The diff coverage is n/a.

:exclamation: Current head 31e0238 differs from pull request most recent head 43cd47a. Consider uploading reports for the commit 43cd47a to get more accurate results

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@           Coverage Diff           @@
##             main      #49   +/-   ##
=======================================
  Coverage   86.79%   86.79%           
=======================================
  Files           9        9           
  Lines         901      901           
=======================================
  Hits          782      782           
  Misses        119      119

wadpac / GGIRread

impute only as many values as needed, not 201 #49

Codecov Report