r-lib / covr

Test coverage reports for R
https://covr.r-lib.org
Other
334 stars 117 forks source link

segfault recently in codecov() #567

Open MichaelChirico opened 6 months ago

MichaelChirico commented 6 months ago

We are observing a segfault in codecov recently, e.g.

https://github.com/Rdatatable/data.table/actions/runs/8885891976/job/24398250857?pr=6107

Unfortunately I haven't been able to reproduce it outside CI, but it's pretty persistent across all commits in the past 2 days or so:

https://github.com/Rdatatable/data.table/actions/workflows/test-coverage.yaml

The stack trace is into try(), but I am not sure how to proceed from here, any ideas?

mcol commented 2 months ago

Your last green run on master was on 24 April, which is when R 4.4.0 was released. Something must have changed in that release that is negatively affecting covr. In https://github.com/R-Lum/Luminescence/issues/144 we've also attempted to debug our incredibly long coverage times, and we reached the conclusion that they started with R 4.4.0: we didn't see any segfault, but coverage times have increased by at least 10x. We're now running coverage on R 4.3.3 without issues.

struckma commented 2 months ago

Your last green run on master was on 24 April, which is when R 4.4.0 was released. Something must have changed in that release that is negatively affecting covr. In R-Lum/Luminescence#144 we've also attempted to debug our incredibly long coverage times, and we reached the conclusion that they started with R 4.4.0: we didn't see any segfault, but coverage times have increased by at least 10x. We're now running coverage on R 4.3.3 without issues.

Indeed, I have observed also some connection with the R version 4.4.0, but I could not really identify the reason. I did also dig in R's C-sources, but there is nothing really obvious. The NEWS.md mentions some changes around I/O, and there was a change in the serialization code.

I could not reliably reproduce the issue in a toy example, but it seems to be related to containerization, parallel test execution and maybe to the way, R tests are started (using the code = {} argument of covr::package_coverage(), e.g.

mcol commented 2 months ago

I've bisected the R changes between 4.3.3 and 4.4.0 (using wch's R-source mirror), and I identified this commit as the one where the problems started. This refers to Bugzilla PR 16756, which indeed mentions covr as being related.

Commenting out these two lines (added in that commit) on the current trunk fixes the issue we've been seeing in https://github.com/R-Lum/Luminescence/issues/144#issuecomment-2291237372: https://github.com/wch/r-source/blob/e391fe166e7af8991b36c4d4f124bf091aa3c702/src/library/utils/R/sourceutils.R#L153-L154

Unfortunately, I'm not sure how to take this forward as I don't understand the original problem at PR 16756 nor its solution. :neutral_face:

Perhaps @MichaelChirico could you see if commenting out those lines fixes the segfault you've been seeing?

MichaelChirico commented 1 month ago

~Hmm, I tried pinning our GHA to R4.3 and still got a segfault:~

~https://github.com/Rdatatable/data.table/pull/6540/commits/ab676f475f6fa32f8b91ab74e64469b25f98db02~

Scratch that, I guess it was a caching issue fixed by switching to the "latest" codecov GHA cribbed from dplyr:

https://github.com/Rdatatable/data.table/pull/6540

Once I did that, I observed the same thing as @mcol -- the codecov GHA being extremely slow, but tolerable when switching to a pinned version of R 4.3.3.