google-research / circuit_training

Apache License 2.0
1.05k stars 173 forks source link

CPU docker build e2e Smoke test failing #54

Closed luarss closed 9 months ago

luarss commented 1 year ago

System Information

I have followed the instructions, checked out the r0.0.3 version as well. This is the main error message in the collect_X.log files

Traceback (most recent call last): File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/workspace/circuit_training/learning/train_ppo.py", line 154, in multiprocessing.handle_main(functools.partial(app.run, main)) File "/usr/local/lib/python3.9/dist-packages/tf_agents/system/default/multiprocessing_core.py", line 77, in handle_main return app.run(parent_main_fn, *args, kwargs) File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.9/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/workspace/circuit_training/learning/train_ppo.py", line 134, in main train_ppo_lib.train( File "/usr/local/lib/python3.9/dist-packages/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/usr/local/lib/python3.9/dist-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.traceback) from None File "/usr/local/lib/python3.9/dist-packages/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, *new_kwargs) File "/workspace/circuit_training/learning/train_ppo_lib.py", line 258, in train save_model_trigger = triggers.PolicySavedModelTrigger( File "/usr/local/lib/python3.9/dist-packages/tf_agents/train/triggers.py", line 127, in init self._raw_policy_saver = self._build_saver(raw_policy, batch_size, File "/usr/local/lib/python3.9/dist-packages/tf_agents/train/triggers.py", line 168, in _build_saver saver = policy_saver.PolicySaver( File "/usr/local/lib/python3.9/dist-packages/tf_agents/policies/policy_saver.py", line 383, in init polymorphic_action_fn.get_concrete_function( File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 1258, in get_concrete_function concrete = self._get_concrete_function_garbage_collected(args, kwargs) File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 1238, in _get_concrete_function_garbage_collected self._initialize(args, kwargs, add_initializers_to=initializers) File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 763, in _initialize self._variable_creation_fn # pylint: disable=protected-access File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/polymorphic_function/tracing_compiler.py", line 171, in _get_concrete_function_internal_garbage_collected concretefunction, = self._maybe_define_concrete_function(args, kwargs) File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/polymorphic_function/tracing_compiler.py", line 166, in _maybe_define_concrete_function return self._maybe_define_function(args, kwargs) File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/polymorphic_function/tracing_compiler.py", line 356, in _maybe_define_function self._function_spec.make_canonicalized_monomorphic_type( File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/polymorphic_function/function_spec.py", line 345, in make_canonicalized_monomorphic_type function_type_lib.canonicalize_to_monomorphic( File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/polymorphism/function_type.py", line 419, in canonicalize_to_monomorphic _make_validated_mono_param(name, arg, poly_parameter.kind, File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/polymorphism/function_type.py", line 359, in _make_validated_mono_param mono_type = trace_type.from_value(value, type_context) File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 194, in from_value named_tuple_type, tuple(from_value(c, context) for c in value)) File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 194, in named_tuple_type, tuple(from_value(c, context) for c in value)) File "/usr/local/lib/python3.9/dist-packages/tensorflow/core/function/trace_type/trace_type_builder.py", line 176, in from_value elif isinstance(value, trace.SupportsTracingProtocol): File "/usr/local/lib/python3.9/dist-packages/typing_extensions.py", line 604, in instancecheck val = inspect.getattr_static(instance, attr) File "/usr/lib/python3.9/inspect.py", line 1624, in getattr_static instance_result = _check_instance(obj, attr) File "/usr/lib/python3.9/inspect.py", line 1571, in _check_instance instance_dict = object.getattribute(obj, "dict") TypeError: this dict descriptor does not support '_DictWrapper' objects In call to configurable 'train' (<function train at 0x7f431bc254c0>)

Also see below for the logfile in full. collect_1.log

luarss commented 1 year ago

I found the error is that my GPU has insufficient memory.

image

Can I clarify how much GPU vRAM is needed for the e2e_test?

esonghori commented 9 months ago

This error should be resolved with the latest version.