Closed barrachri closed 7 years ago
Can you elaborate a bit on the kind of support that you're looking for? You can pass pandas dataframes into (and out of) remote functions. Soon pandas data frames will make more efficient use of shared memory. As for scikit-learn, you can use scikit-learn to define tasks. You should be able to pass scikitlearn objects in and out of remote functions.
Hello @robertnishihara thanks for the answer.
I was checking the docs and the examples but I didn't find a proper solution.
If I understood correctly you can do something like this (pseudo-code):
import ray
import pandas as pd
# download some data from an s3 bucket
@ray.remote
def create_df(s3):
return pd.read_csv(s3)
# put the data together
@ray.remote
def put_together(df_1, df_2):
df = pd.concat([df_1, df_2], ignore_index=True)
return df
# compute something
@ray.remote
def compute_something(df):
return df.describe
data = [create_df.remote(s3) for s3 in list_s3]
while len(data) > 1:
data.append(put_together.remote(data.pop(0), data.pop(0)))
result = compute_something.remote(*data)
ray.get(result)
Is this correct?
With scikit-learn I would like to fit models (ensemble in this case) and then combine the models averaging or by majority.
Yes, the above code looks like it should work. Let me know if you run into any problems. For averaging models, the structure could be the same.
Thanks, I will try and let you know.
Tried....
a few issues
import ray
import pandas as pd
ray.init(redis_address="ip")
# download some data from an s3 bucket
@ray.remote
def create_df(s3):
return pd.read_csv(s3, encoding="utf-8", error_bad_lines=False)
# put the data together
@ray.remote
def put_together(df_1, df_2):
df = pd.concat([df_1, df_2], ignore_index=True)
return df
# compute something
@ray.remote
def compute_something(df):
return df.describe()
list_s3 = [
's3://gdelt-open-data/events/1979.csv',
's3://gdelt-open-data/events/1980.csv',
's3://gdelt-open-data/events/1981.csv',
's3://gdelt-open-data/events/1982.csv',
]
data = [create_df.remote(s3) for s3 in list_s3]
while len(data) > 1:
data.append(put_together.remote(data.pop(0), data.pop(0)))
result = compute_something.remote(*data)
ray.get(result)
File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status (/ray/src/thirdparty/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:8092)
pyarrow.lib.ArrowMemoryError: malloc of size 67108864 failed
How much memory is on the machine? Also, what about the size of the object store (this should be printed when you run ray start
)?
3 machines with 1GB each (just for a basic test).
here?
Waiting for redis server at 127.0.0.1:6379 to respond...
Waiting for redis server at 127.0.0.1:59943 to respond...
Starting local scheduler with 1 CPUs, 0 GPUs
Failed to start the UI, you may need to run 'pip install jupyter'.
{'webui_url': None, 'redis_address': 'ip:6379', 'local_scheduler_socket_names': ['/tmp/scheduler37142258'], 'object_store_addresses': [ObjectStoreAddress(name='/tmp/plasma_store75305730', manager_name='/tmp/plasma_manager71723891', manager_port=19765)], 'node_ip_address': 'ip'}
# download some data from an s3 bucket
@ray.remote
def create_df(s3):
return pd.read_csv(s3, encoding="utf-8", error_bad_lines=False)
# put the data together
@ray.remote
def put_together(df_1, df_2):
df = pd.concat([df_1, df_2], ignore_index=True)
return df
# compute something
@ray.remote
def compute_something(df):
return df.describe()
list_s3 = [
'http://hci.stanford.edu/courses/cs448b/data/ipasn/cs448b_ipasn.csv' for _ in range(45)
]
data = [create_df.remote(s3) for s3 in list_s3]
while len(data) > 1:
data.append(put_together.remote(data.pop(0), data.pop(0)))
result = compute_something.remote(*data)
ray.get(result)
With 45 I usually get a similar error, with 40 it works fine.
I guess arrow should handle this more nicely, no?
Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/ray/worker.py", line 786, in _process_task
self._store_outputs_in_objstore(return_object_ids, outputs)
File "/usr/local/lib/python3.5/dist-packages/ray/worker.py", line 713, in _store_outputs_in_objstore
self.put_object(objectids[i], outputs[i])
File "/usr/local/lib/python3.5/dist-packages/ray/worker.py", line 355, in put_object
self.store_and_register(object_id, value)
File "/usr/local/lib/python3.5/dist-packages/ray/worker.py", line 290, in store_and_register
object_id.id()), self.serialization_context)
File "pyarrow/plasma.pyx", line 394, in pyarrow.plasma.PlasmaClient.put (/ray/src/thirdparty/arrow/python/build/temp.linux-x86_64-3.5/plasma.cxx:5069)
File "pyarrow/serialization.pxi", line 226, in pyarrow.lib.serialize (/ray/src/thirdparty/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:68724)
File "pyarrow/serialization.pxi", line 99, in pyarrow.lib.SerializationContext._serialize_callback (/ray/src/thirdparty/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:66772)
File "/usr/local/lib/python3.5/dist-packages/ray/pyarrow_files/pyarrow/serialization.py", line 113, in _serialize_pandas_dataframe
return serialize_pandas(obj).to_pybytes()
File "/usr/local/lib/python3.5/dist-packages/ray/pyarrow_files/pyarrow/ipc.py", line 166, in serialize_pandas
writer.write_batch(batch)
File "pyarrow/ipc.pxi", line 202, in pyarrow.lib._RecordBatchWriter.write_batch (/ray/src/thirdparty/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:60518)
File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status (/ray/src/thirdparty/arrow/python/build/temp.linux-x86_64-3.5/lib.cxx:8092)
pyarrow.lib.ArrowMemoryError: malloc of size 67108864 failed
What would you prefer the behavior to be in this case? We should be able to do more to reduce the memory footprint of pandas dataframes down the road..
I think fragmenting the operation in a way that is not going to saturate the memory of single machines.
Let's say that the sum of the s3 files is 20gb and the memory of the cluster sum up to 10gb.
You can raise a memory error but is not what you expect from a distributed library I think.
So the preferred behaviour would be to chunk the pandas operations.
in pseudo-code
s3 = files # tuple or list
s3_size = sum([file.size for file in s3])
if s3_size > ram:
# read in chunks
else:
# read the whole file in memory
I would be happy to work on this if help is needed.
I see, some of this could be addressed by spilling to disk #1157.
Another approach would be to improve the scheduling algorithm to schedule things in a way that is likely to use less memory. That's probably out of scope for now.
Another approach would be to have a dataframes library that implements operations lazily and submits task in a way that requires less memory.
Thanks for the feedback. Closing here, moving to #1157.
Hi, is there already a support for pandas and scikit-learn?