How can I get the KKBox split as described in the paper?

havakv / pycox

Survival analysis with PyTorch

BSD 2-Clause "Simplified" License

832 stars 193 forks source link

How can I get the KKBox split as described in the paper? #61

Closed LeeJunHyun closed 3 years ago

LeeJunHyun commented 3 years ago

I really appreciate your commitment to this field.

I got kkbox dataset by using kkbox.read_df().

The kkbox dataset consists of 2,814,735 instances, so how can I get the same split as described in the paper?

In the paper, there are 1,786,333 train samples, 661,748 test samples, and 198,665 valid samples. (2,646,746 instances in total)

Thanks.

havakv commented 3 years ago

Thank you for the kind words!

You're probably looking for kkbox_v1.read_df() which is the dataset used in "Time-to-event prediction with neural networks and Cox regression".

The dataset in kkbox.read_df() is used in the paper "The Brier Score under Administrative Censoring: Problems and Solutions" and has some improvements compared to kkbox_v1 in addition to administrative censoring times.

havakv commented 3 years ago

From the read_df docs you can see that kkbox_v1.read_df accepts arguments for training, validation and test set.

LeeJunHyun commented 3 years ago

@havakv Thank you so much! Your reply is a great help for me :)

havakv commented 3 years ago

Happy to help! I'll close this issue then, and you can reopen it if you don't consider it solved.