sattarov / FedTabDiff

Implementation of the paper: "FedTabDiff: Federated Learning of Diffusion Models for Synthetic Mixed-Type Tabular Data Generation"
https://arxiv.org/abs/2401.06263
MIT License
12 stars 4 forks source link

Simulation crashed. I can't found the reason. Maybe Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #1

Open fwtsinghua opened 2 months ago

fwtsinghua commented 2 months ago

google colab run main

INFO :      Starting Flower simulation, config: num_rounds=100, no round_timeout
Initializing FedTabDiff model
/usr/lib/python3.10/subprocess.py:1796: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = _posixsubprocess.fork_exec(
2024-07-15 09:26:42,335 INFO worker.py:1752 -- Started a local Ray instance.
INFO :      Flower VCE: Ray initialized with resources: {'memory': 8018571264.0, 'node:172.28.0.12': 1.0, 'GPU': 1.0, 'node:__internal_head__': 1.0, 'object_store_memory': 4009285632.0, 'CPU': 2.0, 'accelerator_type:T4': 1.0}
INFO :      Optimize your simulation with Flower VCE: https://flower.ai/docs/framework/how-to-run-simulations.html
INFO :      No `client_resources` specified. Using minimal resources for clients.
INFO :      Flower VCE: Resources for each Virtual Client: {'num_cpus': 1, 'num_gpus': 0.0}
INFO :      Flower VCE: Creating VirtualClientEngineActorPool with 2 actors
INFO :      [INIT]
INFO :      Using initial global parameters provided by strategy
INFO :      Evaluating initial global parameters
INFO :      
INFO :      [ROUND 1]
INFO :      configure_fit: strategy sampled 5 clients (out of 5)
INFO :      aggregate_fit: received 0 results and 5 failures
INFO :      configure_evaluate: strategy sampled 5 clients (out of 5)
INFO :      aggregate_evaluate: received 0 results and 5 failures
INFO :      
INFO :      [ROUND 2]
INFO :      configure_fit: strategy sampled 5 clients (out of 5)
INFO :      aggregate_fit: received 0 results and 5 failures
INFO :      configure_evaluate: strategy sampled 5 clients (out of 5)
INFO :      aggregate_evaluate: received 0 results and 5 failures
INFO :      
INFO :      [ROUND 3]
INFO :      configure_fit: strategy sampled 5 clients (out of 5)
INFO :      aggregate_fit: received 0 results and 5 failures
INFO :      configure_evaluate: strategy sampled 5 clients (out of 5)
INFO :      aggregate_evaluate: received 0 results and 5 failures
INFO :      
INFO :      [ROUND 4]
INFO :      configure_fit: strategy sampled 5 clients (out of 5)
INFO :      aggregate_fit: received 0 results and 5 failures
INFO :      configure_evaluate: strategy sampled 5 clients (out of 5)
INFO :      aggregate_evaluate: received 0 results and 5 failures
INFO :      
INFO :      [ROUND 5]
INFO :      configure_fit: strategy sampled 5 clients (out of 5)
INFO :      aggregate_fit: received 0 results and 5 failures
INFO :      configure_evaluate: strategy sampled 5 clients (out of 5)
INFO :      aggregate_evaluate: received 0 results and 5 failures
INFO :      
INFO :      [ROUND 6]
INFO :      configure_fit: strategy sampled 5 clients (out of 5)
INFO :      aggregate_fit: received 0 results and 5 failures
INFO :      configure_evaluate: strategy sampled 5 clients (out of 5)
INFO :      aggregate_evaluate: received 0 results and 5 failures
INFO :      
INFO :      [ROUND 7]
INFO :      configure_fit: strategy sampled 5 clients (out of 5)
INFO :      aggregate_fit: received 0 results and 5 failures
INFO :      configure_evaluate: strategy sampled 5 clients (out of 5)
INFO :      aggregate_evaluate: received 0 results and 5 failures
INFO :      
INFO :      [ROUND 8]
INFO :      configure_fit: strategy sampled 5 clients (out of 5)
INFO :      aggregate_fit: received 0 results and 5 failures
INFO :      configure_evaluate: strategy sampled 5 clients (out of 5)
INFO :      aggregate_evaluate: received 0 results and 5 failures
INFO :      
INFO :      [ROUND 9]
INFO :      configure_fit: strategy sampled 5 clients (out of 5)
INFO :      aggregate_fit: received 0 results and 5 failures
INFO :      configure_evaluate: strategy sampled 5 clients (out of 5)
INFO :      aggregate_evaluate: received 0 results and 5 failures
INFO :      
INFO :      [ROUND 10]
INFO :      configure_fit: strategy sampled 5 clients (out of 5)
INFO :      aggregate_fit: received 0 results and 5 failures
Initializing FedTabDiff model
[Server evaluation, server round: 10
Loading eval set
SAMPLING STEP:  499: : 0it [00:00, ?it/s]
ERROR :     Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
ERROR :     Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/flwr/simulation/app.py", line 323, in start_simulation
    hist = run_fl(
  File "/usr/local/lib/python3.10/dist-packages/flwr/server/server.py", line 490, in run_fl
    hist, elapsed_time = server.fit(
  File "/usr/local/lib/python3.10/dist-packages/flwr/server/server.py", line 126, in fit
    res_cen = self.strategy.evaluate(current_round, parameters=self.parameters)
  File "/usr/local/lib/python3.10/dist-packages/flwr/server/strategy/fedavg.py", line 167, in evaluate
    eval_res = self.evaluate_fn(server_round, parameters_ndarrays, {})
  File "/content/cloned-repo/FlowerServer.py", line 49, in evaluate_server
    generated_samples = generate_samples(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/content/cloned-repo/fedtabdiff_modules.py", line 171, in generate_samples
    model_out = synthesizer(z_norm.float(), t, label)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/cloned-repo/MLPSynthesizer.py", line 148, in forward
    emb = self.time_embed(timestep_embedding(timesteps, self.dim_t))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

ERROR :     Your simulation crashed :(. This could be because of several reasons. The most common are: 
     > Sometimes, issues in the simulation code itself can cause crashes. It's always a good idea to double-check your code for any potential bugs or inconsistencies that might be contributing to the problem. For example: 
         - You might be using a class attribute in your clients that hasn't been defined.
         - There could be an incorrect method call to a 3rd party library (e.g., PyTorch).
         - The return types of methods in your clients/strategies might be incorrect.
     > Your system couldn't fit a single VirtualClient: try lowering `client_resources`.
     > All the actors in your pool crashed. This could be because: 
         - You clients hit an out-of-memory (OOM) error and actors couldn't recover from it. Try launching your simulation with more generous `client_resources` setting (i.e. it seems {'num_cpus': 1, 'num_gpus': 0.0} is not enough for your run). Use fewer concurrent actors. 
         - You were running a multi-node simulation and all worker nodes disconnected. The head node might still be alive but cannot accommodate any actor with resources: {'num_cpus': 1, 'num_gpus': 0.0}.
Take a look at the Flower simulation examples for guidance <https://flower.ai/docs/framework/how-to-run-simulations.html>.
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/flwr/simulation/app.py](https://localhost:8080/#) in start_simulation(client_fn, num_clients, clients_ids, client_resources, server, config, strategy, client_manager, ray_init_args, keep_initialised, actor_type, actor_kwargs, actor_scheduling)
    322         # Start training
--> 323         hist = run_fl(
    324             server=initialized_server,

16 frames
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/flwr/simulation/app.py](https://localhost:8080/#) in start_simulation(client_fn, num_clients, clients_ids, client_resources, server, config, strategy, client_manager, ray_init_args, keep_initialised, actor_type, actor_kwargs, actor_scheduling)
    357             client_resources,
    358         )
--> 359         raise RuntimeError("Simulation crashed.") from ex
    360 
    361     finally:

RuntimeError: Simulation crashed.
fwtsinghua commented 2 months ago

It appears there's still a bug in the code.

The Client is not executing the model training process via the fit function. I've checked the implementation and configuration, but the issue persists.

client_fn = get_client_fn(  # Return a function to construct a client.
    train_loaders=train_loaders_client,  # int(cid) = 0,1,2,3,4
    test_loaders=test_loaders_client,
    exp_params=exp_params
)

Maybe above code doesn't be called and construct a client in the simulation process.

2045ga commented 2 months ago

Same Question. Can't Run it on Colab.