square / pysurvival

Open source package for Survival Analysis modeling
https://www.pysurvival.io/
Apache License 2.0
350 stars 106 forks source link

RSF implementation seems to hang when predicting #3

Closed simonthorogood closed 5 years ago

simonthorogood commented 5 years ago

I've installed pysurvival using brew for gcc and pip on my MacBook Pro (macOS 10.13.6) and been able to train a RSF model in a Jupyter Notebook (though this took several minutes of high CPU activity). The training data has around 70 factors and 5000 rows.

I'm now trying to work with the model but when I call e.g.

risks = rsf.predict_risk(X_test)

the notebook just hangs indefinitely with no sign of CPU activity.

steph-likes-git commented 5 years ago

hey @simonthorogood ,

Unfortunately, without being able to see the notebook or the data, it's going to be difficult to provide any insights.

simonthorogood commented 5 years ago

Hi - thanks for getting back. It turned out my problem was that I was using a time scale with 1000s of distinct values (number of days up to 15 years), resulting in 1000s of time 'buckets'. When I pre-bucketed the time variable into 10th of a year, everything worked (albeit slowly).

micmart commented 5 years ago

I'm experiencing a similar issue. I have a df with 40k rows and 21 variables. I am following the Churn prediction tutorial. csf_fit() works fine and takes 45min to run. But when I then run concordance_index() my session crashes and I lose my csf object. I am a python novice and therefore I can't really say what the issue is only that it works with a smaller sample i.e. 5k rows but as soon as the data becomes large it seems to have issues.

@simonthorogood , I also have a days from 1970-01-01 variable. But shouldn't that be fine even if a variable has 1000s of distinct values? Any other float variable should cause the same issue.

Btw, I have tried to store the csf object using pickle but wasn't able to. Any other way to store the csf object to disk? Thanks!

simonthorogood commented 5 years ago

@micmart I'm afraid I don't know the underlying causes, but my experience was very similar to yours (model gets fitted but c-index calculation crashes). My 'fix' was to bucket the time values so that there were far fewer distinct values present (e.g. 1.1, 1.2, 1.3 years etc.). Note that pysurvival models seem to create time buckets (see time_buckets attribute) automatically during fitting and the number of buckets created seems to reflect the number of values present in the input.