huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.7k stars 938 forks source link

can't submit NoneType or dict arguments to Tensorboard #3063

Open bghira opened 2 weeks ago

bghira commented 2 weeks ago

System Info

latest accelerate version

Information

Tasks

Reproduction

    def init_trackers(self):
        # We need to initialize the trackers we use, and also store our configuration.
        # The trackers initializes automatically on the main process.
        self.guidance_values_table = None
        if self.accelerator.is_main_process:
            # Copy args into public_args:
            public_args = copy.deepcopy(self.config)
            delattr(public_args, "accelerator_project_config")
            delattr(public_args, "process_group_kwargs")
            delattr(public_args, "weight_dtype")
            delattr(public_args, "base_weight_dtype")
            if 'tensorboard' in self.config.report_to:
                # Filter out incompatible types for TensorBoard
                public_args = {
                    key: value
                    for key, value in vars(public_args).items()
                    if isinstance(value, (int, float, str, bool, torch.Tensor))
                }

            # Hash the contents of public_args to reflect a deterministic ID for a single set of params:
            public_args_hash = hashlib.md5(
                json.dumps(vars(public_args), sort_keys=True).encode("utf-8")
            ).hexdigest()
            project_name = self.config.tracker_project_name or "simpletuner-training"
            tracker_run_name = (
                self.config.tracker_run_name or "simpletuner-training-run"
            )
            self.accelerator.init_trackers(
                project_name,
                config=vars(public_args),
                init_kwargs={
                    "wandb": {
                        "name": tracker_run_name,
                        "id": f"{public_args_hash}",
                        "resume": "allow",
                        "allow_val_change": True,
                    }
                },
            )

Expected behavior

this isn't a minimal reproducer but it outlines what we're doing to trigger the problem and also sort of work around it.

the Accelerator init receives the --report_to value which can be a csv list like wandb,tensorboard or just all which i guess will transfer these configurations up to all of the trackers.

however, it fails to consider the limitations of types that each receiving backend can handle.

wandb can not handle torch dtypes or its own accelerator configs / kwargs objects as they do not serialise.

similarly, tensorboard only handles int, float, str, bool, torch.Tensor but passes everything through.

i'm not sure what the best way to handle this is, maybe ignore_unsupported_values that we can set to True which will then not pass unsupported types into a given backend.

the reason I'm requesting this be supported directly in Accelerate is because we cannot manually initialise each tracker individually. if we could do that, or maybe I'm just missing how to do so, that would negate this request too

muellerzr commented 2 weeks ago

You can manually initialize each tracker and pass in a tracker manually, similar to the custom trackers: https://huggingface.co/docs/accelerate/usage_guides/tracking#implementing-custom-trackers

Just pass in the instance to log_with (looking at it we can/should expand the docs on this)

However you can then use the later API using get_tracker to run things yourself: https://huggingface.co/docs/accelerate/usage_guides/tracking#accessing-the-internal-tracker