ray-project / tune-sklearn

A drop-in replacement for Scikit-Learn’s GridSearchCV / RandomizedSearchCV -- but with cutting edge hyperparameter tuning techniques.
https://docs.ray.io/en/master/tune/api_docs/sklearn.html
Apache License 2.0
465 stars 52 forks source link

GridSearchCV, RandomizedSearchCV don't support big dataset train? #232

Open ICESDHR opened 2 years ago

ICESDHR commented 2 years ago

GridSearchCV and RandomizedSearchCV work well when dataset is small, but it's broken down when dataset is large.

I read GridSearchCV implementation. In _fit() function, you put dataset to Ray Object Store, but ray.put() function use grpc to transform data, grpc protobuf doesn't support data greater than 2GB. Is that right?

X_id = ray.put(X)
y_id = ray.put(y)

I ask this question because I want to know whether you know this situation and whether you have an optimization plan?

thanks for your reply!

Yard1 commented 2 years ago

ray.put should support objects bigger than 2 GB. Are you sure you are not simply running out of memory? What sort of errors are you getting?

ICESDHR commented 2 years ago

my ray cluster work node hava 20GB memory, i ask question in ray project too, they advised me not to use PUT to transfer big data :(

 ERROR - Exception serializing message!
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/grpc/_common.py", line 86, in _transform
    return transformer(message)
ValueError: Message ray.rpc.DataRequest exceeds maximum protobuf size of 2GB: 5979803541
ERROR dataclient.py:150 -- Unrecoverable error in data channel.
richardliaw commented 2 years ago

Are you using Kubernetes / ray client @ICESDHR ?

ICESDHR commented 2 years ago

yeah, i use kubernets ray operator here.follow these step:

  1. deploy raycluster crd,
  2. deploy operator,
  3. create a raycluster which contain one head and three worker,
  4. create a kubernetes job, in this job i connet to raycluster and success, but if i use ray.put() to put dataset which large than 2GB i 'll get this error(Inevitable mistakes).If i use other samll dataset, it works well.

so, i think Here are some bugs, do you think? if so, I wonder if we could discuss a solution? if not, plz help me how to use this function, thx : )

Yard1 commented 2 years ago

Ok, I can see this being a problem. Would it be possible for you to load the data from S3/NFS/disk on the nodes? If yes, we could add support for that. How does that sound?

ICESDHR commented 2 years ago

thanks for your reply! It would be nice if official support could be provided. How soon will this patch be released?

ICESDHR commented 2 years ago

after load big dataset, i face another two problems:

  1. use this method, if one node run many trials, it will load multiple copies of data into memory repeatedly, cause memory waste;
  2. In my practice, with sufficient resources, after load big dataset from disk, training process will trigger problems as shown in the figure. i change ray/tune/ray_trial_executor.py DEFAULT_GET_TIMEOUT 60 -> 300,it works.

Is there a better solution?

Untitled Untitled

Yard1 commented 2 years ago

It's not possible right now, but with the proposed changes, you should be able to use Ray Datasets, which should solve both the 2GB issue and ensure a minimum amount of copying required. Will keep you updated.

ICESDHR commented 2 years ago

I'm trying to use ray Data function, i found Operate as above, ray.data.from_pandas() still have this problem, but ray.data.read_csv() works :( Usually, the data will be processed by pandas first and trained with ray, so it would be great if it could be optimized.