therneau / survival

Survival package for R
394 stars 106 forks source link

Segfault with survreg() #263

Closed admash closed 2 months ago

admash commented 4 months ago

Hello!

I have run into a problem where repeated calls to try(survreg(...)) that do not converge, are causing R to segfault. The number of calls necessary to produce a segfault depends on the size of the dataset.

I have attached a .zip file with two minimal code examples that produce the crash. One uses a single row data frame, while the other uses an included ~11k data frame. The smaller data frame results in a segfault after ~250 calls, while the large data frame produces a segfault after about ~10 calls.

survreg-try-reprex.zip

The output logs for the two examples are attached here:

crash-01.log crash-02.log

I am running Arch Linux, with the following output from R.version:

> R.version

platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          4.1                         
year           2024                        
month          06                          
day            14                          
svn rev        86737                       
language       R                           
version.string R version 4.4.1 (2024-06-14)
nickname       Race for Your Life 

Please let me know if you need further information.

-admash

admash commented 4 months ago

This segfault has been externally verified for MacOS running on ARM as well. The provided output is attached here:

crash-03.log

therneau commented 4 months ago

The issue appears to be with the return value when the iteration does not converge. I'll look deeper. Data sets where survreg does not converge are very rare.

therneau commented 4 months ago

The survreg code first fits a model with only intercept and scale, to use as starting estimates. That iteration is failing, which leads to invalid arguments for the C routine that fails. This first bit has never failed before, and I have no checks for that. That wil be easy to fix. Failure was guarranteed in your small data set (one one obs and 2 parameters), the bigger set is an interesting puzzle to understand.

therneau commented 2 months ago

Now fixed. There was an error such that step halving was not properly invoked if the trial loglik was infinite. Your data set leads to a particularly bad first Newton-Raphson step.

admash commented 2 months ago

Thanks Terry. You remain firmly ensconced in my statistical pantheon.

admash commented 2 months ago

Hi Terry,

Unfortunately, we are still experiencing segfaults. I have attached a .zip file with code and data to reproduce the crash.

survreg-crash2-reprex.zip

Let me know if you would prefer that I make a new issue.

therneau commented 2 months ago

Two issues here. First, you have found a data set for which the initial (intercept, scale) model fit fails. The true solution for (intercept, log(scale)) is approx (4.5, -.39), From a starting estimate of (3.04, .0162) for the first Newton-Raphson step is (17.5, -130.4); the iteration never recovers. The second issue is that I don't have a check for whether that first step fails, and the infinite value that arises causes the further code to fall apart. I'll fix the second now. The data you sent will get added to my fail directory, and I'll have to work out a solution to the iteration problem. Likely a trust region method.