ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.11k stars 1.19k forks source link

Ray - protobuf issue #3963

Open robhheise opened 6 months ago

robhheise commented 6 months ago

Describe the bug Using one of the examples, I am getting a protobuf error trace

To Reproduce (train_stats, preprocessed_data, output_directory) = model.train( dataset, model_name="rotten_tomatoes", output_directory="results_rotten_tomatoes", )

Please provide code, yaml config file and a sample of data in order to entirely reproduce the issue. Issues that are not reproducible will be ignored. config_yaml = """ input_features:

Screenshots If applicable, add screenshots to help explain your problem. Traceback (most recent call last): File "/Users/robertheise/Documents/SD/kh-accel/model.py", line 52, in (train_stats, preprocessed_data, output_directory) = model.train( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ludwig/api.py", line 654, in train self._tune_batch_size(trainer, training_set, random_seed=random_seed) File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ludwig/api.py", line 882, in _tune_batch_size tuned_batch_size = trainer.tune_batch_size( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 551, in tune_batch_size result = runner.run( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 441, in run return fit_no_exception(trainer) File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 335, in fit_no_exception result_grid = tuner.fit() File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/tuner.py", line 292, in fit return self._local_tuner.fit() File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 455, in fit analysis = self._fit_internal(trainable, param_space) File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 572, in _fit_internal analysis = run( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/tune.py", line 678, in run callbacks = _create_default_callbacks( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/utils/callback.py", line 105, in _create_default_callbacks callbacks.append(TBXLoggerCallback()) File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/logger/tensorboardx.py", line 165, in init from tensorboardX import SummaryWriter File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/init.py", line 5, in from .torchvis import TorchVis File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/torchvis.py", line 11, in from .writer import SummaryWriter File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/writer.py", line 18, in from .event_file_writer import EventFileWriter File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/event_file_writer.py", line 28, in from .proto import event_pb2 File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/proto/event_pb2.py", line 16, in from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summarypb2 File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/proto/summary_pb2.py", line 16, in from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensorpb2 File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/proto/tensor_pb2.py", line 16, in from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resourcehandlepb2 File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 36, in _descriptor.FieldDescriptor( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/google/protobuf/descriptor.py", line 553, in new _message.Message._CheckCalledFromGeneratedFile() TypeError: Descriptors cannot be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0. If you cannot immediately regenerate your protos, some other possible workarounds are:

  1. Downgrade the protobuf package to 3.20.x or lower.
  2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

Environment (please complete the following information):

Additional context Add any other context about the problem here.

arnavgarg1 commented 6 months ago

@robhheise May I know what ray version you're using?

robhheise commented 6 months ago

using the default local Ray version from this code: backend_config = { "type": "ray", "processor": { "parallelism": 6, "type": "dask", }, "trainer": { "use_gpu": False, "num_workers": 3, "resources_per_worker": { "CPU": 2, "GPU": 0, }, }, } backend = initialize_backend(backend_config)

robhheise commented 6 months ago

Maybe a clarifying question: do I need to install a running Ray cluster before initializing, and if so, do I have to use the containers available in the ludwig github repo? The documentation https://ludwig.ai/latest/user_guide/distributed_training/ suggests that I need to install Ray. FYI - the Ray Launcher link in the documentation is giving 404: https://docs.ray.io/en/latest/cluster/launcher.html

arnavgarg1 commented 6 months ago

@robhheise You don't need to! Ludwig is able to connect to an existing ray cluster if its already initialized in your environment, otherwise it'll initialize a local ray cluster for you.

Ray should have been installed as a part of ludwig[distributed]

robhheise commented 6 months ago

Thanks, so it does look the the local ray cluster is being initialized, but still getting the following trace: Traceback (most recent call last): File "/Users/robertheise/Documents/SD/kh-accel/model.py", line 52, in (train_stats, preprocessed_data, output_directory) = model.train( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ludwig/api.py", line 654, in train self._tune_batch_size(trainer, training_set, random_seed=random_seed) File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ludwig/api.py", line 882, in _tune_batch_size tuned_batch_size = trainer.tune_batch_size( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 551, in tune_batch_size result = runner.run( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 441, in run return fit_no_exception(trainer) File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 335, in fit_no_exception result_grid = tuner.fit() File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/tuner.py", line 292, in fit return self._local_tuner.fit() File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 455, in fit analysis = self._fit_internal(trainable, param_space) File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/impl/tuner_internal.py", line 572, in _fit_internal analysis = run( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/tune.py", line 678, in run callbacks = _create_default_callbacks( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/utils/callback.py", line 105, in _create_default_callbacks callbacks.append(TBXLoggerCallback()) File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/ray/tune/logger/tensorboardx.py", line 165, in init from tensorboardX import SummaryWriter File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/init.py", line 5, in from .torchvis import TorchVis File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/torchvis.py", line 11, in from .writer import SummaryWriter File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/writer.py", line 18, in from .event_file_writer import EventFileWriter File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/event_file_writer.py", line 28, in from .proto import event_pb2 File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/proto/event_pb2.py", line 16, in from tensorboardX.proto import summary_pb2 as tensorboardX_dot_proto_dot_summarypb2 File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/proto/summary_pb2.py", line 16, in from tensorboardX.proto import tensor_pb2 as tensorboardX_dot_proto_dot_tensorpb2 File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/proto/tensor_pb2.py", line 16, in from tensorboardX.proto import resource_handle_pb2 as tensorboardX_dot_proto_dot_resourcehandlepb2 File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/tensorboardX/proto/resource_handle_pb2.py", line 36, in _descriptor.FieldDescriptor( File "/Users/robertheise/Documents/SD/kh-accel/venv/lib/python3.10/site-packages/google/protobuf/descriptor.py", line 553, in new _message.Message._CheckCalledFromGeneratedFile() TypeError: Descriptors cannot be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0. If you cannot immediately regenerate your protos, some other possible workarounds are:

  1. Downgrade the protobuf package to 3.20.x or lower.
  2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates (dask:('getitem-bf2c0ba7d80a99c4d574d2369c944368', 0) pid=12348)