Is wsknn suit for large scale dataset?

ralgond commented 1 year ago

I have a dataset, in which the session count is 10M and the item count is 1M. :-)

thank you.

SimonMolinsky commented 1 year ago

Hi, it depends - the model is memory hungry, and sometimes it is impossible to parse big datasets, but we've used similar datasets. Our data (for now) is close to yours, something about 10M sessions, but we have fewer items, approx. 200k (we had a case where we used more items, but the model performance was better when we divided items into the closest categories).

But you made me think. I will check the limits of a model with simulations and update the docs accordingly.

Sorry for the late answer, I was on the move between countries during the winter holidays and didn't notice your question.

SimonMolinsky commented 1 year ago

Hi @ralgond ,

I've done some benchmarking tests on my "normal" machine here: https://github.com/nokaut/wsknn/tree/dev#benchmarking

What can I say... your model is limited by the RAM that you have. You should assume that 2x more RAM is needed for operations to ensure everything will be fine on production. We will optimize it further in the future. Maybe it will have some use for you, but with issue #18 we will introduce data preprocessing pipelines that make it easier to treat this package as a single module that you can put in one container and ML pipeline. Thanks for your interest!

nokaut / wsknn

Is wsknn suit for large scale dataset? #17