[core][compiled graphs] Support all torch.dtypes for tensors sent through shared memory channels

ruisearch42 commented 1 month ago

What happened + What you expected to happen

Got the following error when experimenting with OpenRLHF:

TypeError: Got unsupported ScalarType BFloat16 :

(CriticModelRayActor pid=33730)   cpu_tensor = torch.from_numpy(np_array) [repeated 2x across cluster]
(RewardModelRayActor pid=33731) Traceback (most recent call last): [repeated 4x across cluster]
(RewardModelRayActor pid=33731) 2024-10-21 09:10:49 ERROR    Compiled DAG task exited with exception
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/channel/shared_memory_channel.py", line 469, in write [repeated 5x across cluster]
(RewardModelRayActor pid=33731)     serialized_value = self._worker.get_serialization_context().serialize(
(RewardModelRayActor pid=33731)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/channel/torch_tensor_type.py", line 108, in serialize [repeated 2x across cluster]
(RewardModelRayActor pid=33731)     return self._serialize_to_msgpack(value)
(RewardModelRayActor pid=33731)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/serialization.py", line 497, in _serialize_to_msgpack
(RewardModelRayActor pid=33731)     pickle5_serialized_object = self._serialize_to_pickle5(
(RewardModelRayActor pid=33731)                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/serialization.py", line 439, in _serialize_to_pickle5 [repeated 2x across cluster]
(RewardModelRayActor pid=33731)     raise e
(RewardModelRayActor pid=33731)     inband = pickle.dumps(
(RewardModelRayActor pid=33731)            ^^^^^^^^^^^^^ [repeated 2x across cluster]
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/cloudpickle/cloudpickle.py", line 1479, in dumps
(RewardModelRayActor pid=33731)     cp.dump(obj)
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/cloudpickle/cloudpickle.py", line 1245, in dump
(RewardModelRayActor pid=33731)     return super().dump(obj)
(RewardModelRayActor pid=33731)            ^^^^^^^^^^^^^^^^^
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/serialization.py", line 175, in _CloudPicklerReducer
(RewardModelRayActor pid=33731)     return custom_deserializer, (custom_serializer(obj),)
(RewardModelRayActor pid=33731)                                  ^^^^^^^^^^^^^^^^^^^^^^
(RewardModelRayActor pid=33731)     return ctx.serialization_context.serialize_tensor(t)
(RewardModelRayActor pid=33731)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/channel/serialization_context.py", line 77, in serialize_tensor
(RewardModelRayActor pid=33731)     return self.serialize_to_numpy(tensor)
(RewardModelRayActor pid=33731)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/experimental/channel/serialization_context.py", line 88, in serialize_to_numpy
(RewardModelRayActor pid=33731)     return tensor.numpy()
(RewardModelRayActor pid=33731)            ^^^^^^^^^^^^^^
(RewardModelRayActor pid=33731) TypeError: Got unsupported ScalarType BFloat16
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/dag/compiled_dag_node.py", line 118, in do_exec_tasks
(RewardModelRayActor pid=33731)     done = tasks[operation.exec_task_idx].exec_operation(
(RewardModelRayActor pid=33731)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/dag/compiled_dag_node.py", line 547, in exec_operation
(RewardModelRayActor pid=33731)     return self._write()
(RewardModelRayActor pid=33731)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/dag/compiled_dag_node.py", line 517, in _write
(RewardModelRayActor pid=33731)     self.output_writer.write(output_val)
(RewardModelRayActor pid=33731)     channel.write(val_i, timeout)
(RewardModelRayActor pid=33731)     channel.write(value, timeout)
(RewardModelRayActor pid=33731)     self._buffers[self._next_write_index].write(value, timeout)

Versions / Dependencies

head

Reproduction script

will add later

Issue Severity

None

SumanthRH commented 1 month ago

Another data point is that Pytorch has a number of other data types that aren't natively supported in numpy (like float8, etc). I'm not particularly sure of the details for which tensor was converted to a numpy array here but it would be best to not assume any interoperability with numpy and keep tensors as tensors during serialization.

stephanie-wang commented 1 month ago

Hmm really what we need is to be able to zero-copy deserialize torch.Tensors. Using numpy was easiest at the time but this is indeed an issue. Maybe arrow is an option?

stephanie-wang commented 1 month ago

Hmm actually seems like this could be supported by using torch.Tensor.view with a uint8 dtype:

In [99]: t
Out[99]:
tensor([ 1.8750, -0.6875, -1.1250, -1.3750,  1.3750, -1.1250,  0.4688, -0.4062,
         0.8750, -1.7500], dtype=torch.float8_e4m3fn)

In [104]: torch.as_tensor(t.view(torch.uint8).numpy()).view(torch.float8_e4m3fn)
Out[104]:
tensor([ 1.8750, -0.6875, -1.1250, -1.3750,  1.3750, -1.1250,  0.4688, -0.4062,
         0.8750, -1.7500], dtype=torch.float8_e4m3fn)