ropensci / daiquiri

Data quality reporting for temporal datasets.
https://ropensci.github.io/daiquiri/
GNU General Public License v3.0
35 stars 2 forks source link

CRAN submission NOTE: Examples with CPU time > 2.5 times elapsed time #15

Closed phuongquan closed 1 year ago

phuongquan commented 1 year ago

Looks like the CRAN (Debian) settings no longer help data.table control the number of threads used when being CRAN checked. Email exchange with CRAN team below:

On 13.07.2023 13:32, Phuong Quan wrote:

Hello,

I have this new NOTE in the Debian pretest (but not on Windows), which I think might be a false positive:

  • checking examples ... [7s/2s] NOTE Examples with CPU time > 2.5 times elapsed time user system elapsed ratio aggregate_data 2.54 0.045 0.509 5.079

My understanding is that the NOTE is caused by the use of multiple threads/processes, but I do not employ any parallelism in the package. I can only assume therefore that it is the data.table package (which the aggregate_data() function uses) that is doing the parallelism.

I found a data.table thread from 2019 (https://github.com/Rdatatable/data.table/issues/3300) where the eplusr package got the same CRAN pretest NOTE only on Debian, and where the data.table maintainer Matt Dowle says: "Around that time, that CRAN machine used a value of 4 for OMP_THREAD_LIMIT. I discovered that and agreed with CRAN maintainers that it should be 2. It is now 2. That one machine (linux-debian) handles 4 lines of the CRAN checks matrix: devel-gcc, devel-clang, patched-linux and release-linux, which is why those 4 were affected. You should be able to reproduce the note with export OMP_THREAD_LIMIT=4, but not with 2. There was a problem in data.table not respecting OMP_THREAD_LIMIT but that was fixed in v1.12.2 (7 Apr 2019); news item 3. Then when data.table started to correctly respect OMP_THREAD_LIMIT it took a while to discover that one CRAN machine used a value of 4."

Could it be that the Debian CRAN machine has a value of 4 for OMP_THREAD_LIMIT again? The daiquiri package does not alter the number of threads or thread limit at any point.

We do not set the flag anymore: Users may be unaware that parallelism is used and that they have to set such an env var to avoid it. The package shoudl make sure that not more than 2 cores are used unless expicitly requested by the user.

Best, Uwe Ligges

We don't get the NOTE in v1.0.3 CRAN checks, but this may be because v1.1.0 has a larger example dataset.

Until/unless data.table implement a fix for this, probably should use setDTthreads() for all relevant examples and vignettes.