therneau / survival

Survival package for R
381 stars 104 forks source link

Segfault with survreg() #263

Closed admash closed 3 days ago

admash commented 2 months ago

Hello!

I have run into a problem where repeated calls to try(survreg(...)) that do not converge, are causing R to segfault. The number of calls necessary to produce a segfault depends on the size of the dataset.

I have attached a .zip file with two minimal code examples that produce the crash. One uses a single row data frame, while the other uses an included ~11k data frame. The smaller data frame results in a segfault after ~250 calls, while the large data frame produces a segfault after about ~10 calls.

survreg-try-reprex.zip

The output logs for the two examples are attached here:

crash-01.log crash-02.log

I am running Arch Linux, with the following output from R.version:

> R.version

platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          4.1                         
year           2024                        
month          06                          
day            14                          
svn rev        86737                       
language       R                           
version.string R version 4.4.1 (2024-06-14)
nickname       Race for Your Life 

Please let me know if you need further information.

-admash

admash commented 2 months ago

This segfault has been externally verified for MacOS running on ARM as well. The provided output is attached here:

crash-03.log

therneau commented 2 months ago

The issue appears to be with the return value when the iteration does not converge. I'll look deeper. Data sets where survreg does not converge are very rare.

therneau commented 2 months ago

The survreg code first fits a model with only intercept and scale, to use as starting estimates. That iteration is failing, which leads to invalid arguments for the C routine that fails. This first bit has never failed before, and I have no checks for that. That wil be easy to fix. Failure was guarranteed in your small data set (one one obs and 2 parameters), the bigger set is an interesting puzzle to understand.

therneau commented 3 days ago

Now fixed. There was an error such that step halving was not properly invoked if the trial loglik was infinite. Your data set leads to a particularly bad first Newton-Raphson step.

admash commented 2 days ago

Thanks Terry. You remain firmly ensconced in my statistical pantheon.