Not on same device error

yankeesong commented 1 year ago

I am doing some ablation study by trying to use VSD guidance in dreamfusion-sd. However there is a bunch of "not on the same device" error. These errors are not there if I use prolificdreamer instead.

DSaurus commented 1 year ago

Hi, @yankeesong. Could you please provide further information about this error, such as where it occurs, and details about your running environment?

yankeesong commented 1 year ago

Hi, @DSaurus,

I am running on a Linux PPc64le cluster, in my own conda environment with python=3.10, cuda=11.4, pytorch=1.12.1. I am running the following (modified) config:

name: "dreamfusion-sd-test"
tag: "${rmspace:${system.prompt_processor.prompt},_}"
exp_root_dir: "outputs"
seed: 0

data_type: "random-camera-datamodule"
data:
  batch_size: 1
  width: 64
  height: 64
  camera_distance_range: [1.5, 2.0]
  fovy_range: [40, 70]
  elevation_range: [-10, 90]
  #light_sample_strategy: "dreamfusion"
  eval_camera_distance: 2.0
  eval_fovy_deg: 70.

system_type: "dreamfusion-system"
system:
  geometry_type: "implicit-volume"
  geometry:
    radius: 2.0
    normal_type: "analytic"

    # the density initialization proposed in the DreamFusion paper
    # does not work very well
    # density_bias: "blob_dreamfusion"
    # density_activation: exp
    # density_blob_scale: 5.
    # density_blob_std: 0.2

    # use Magic3D density initialization instead
    density_bias: "blob_magic3d"
    density_activation: softplus
    density_blob_scale: 10.
    density_blob_std: 0.5

    pos_encoding_config:
      otype: HashGrid
      n_levels: 16
      n_features_per_level: 2
      log2_hashmap_size: 19
      base_resolution: 16
      per_level_scale: 1.447269237440378 # max resolution 4096

  material_type: "no-material"
  material:
    n_output_dims: 3
    color_activation: sigmoid

  background_type: "neural-environment-map-background"
  background:
    color_activation: sigmoid

  renderer_type: "nerf-volume-renderer"
  renderer:
    radius: ${system.geometry.radius}
    num_samples_per_ray: 512

  prompt_processor_type: "stable-diffusion-prompt-processor"
  prompt_processor:
    pretrained_model_name_or_path: "stabilityai/stable-diffusion-2-1-base"
    prompt: ???

  guidance_type: "stable-diffusion-vsd-guidance"
  guidance:
    pretrained_model_name_or_path: "stabilityai/stable-diffusion-2-1-base"
    pretrained_model_name_or_path_lora: "stabilityai/stable-diffusion-2-1"
    guidance_scale: 7.5
    min_step_percent: 0.02
    max_step_percent: 0.98

  loggers:
    wandb:
      enable: false
      project: 'threestudio'
      name: None

  loss:
    lambda_vsd: 1.
    lambda_lora: 1.
    lambda_orient: [0, 10., 1000., 5000]
    lambda_sparsity: 1.
    lambda_opaque: 0.
  optimizer:
    name: Adam
    args:
      lr: 0.01
      betas: [0.9, 0.99]
      eps: 1.e-15
    params:
      geometry.encoding:
        lr: 0.01
      geometry.density_network:
        lr: 0.001
      geometry.feature_network:
        lr: 0.001

trainer:
  max_steps: 10000
  log_every_n_steps: 1
  num_sanity_val_steps: 0
  val_check_interval: 200
  enable_progress_bar: true
  precision: 16-mixed

checkpoint:
  save_last: true # save at each validation time
  save_top_k: -1
  every_n_train_steps: ${trainer.max_steps}

The first error happens here:

self.camera_embedding = ToWeightsDType(
            TimestepEmbedding(16, 1280).to(self.device), self.weights_dtype
        )

but it can be resolved by sending to self.device (which I already did)

The next error happens here:

noise_pred_est = self.forward_unet(
                self.unet_lora,
                latent_model_input,
                torch.cat([t] * 2),
                encoder_hidden_states=text_embeddings,
                class_labels=torch.cat(
                    [
                        camera_condition.view(B, -1),
                        torch.zeros_like(camera_condition.view(B, -1)),
                    ],
                    dim=0,
                ),
                cross_attention_kwargs={"scale": 1.0},
            )

with the following (truncated) error message:

│ /nobackup/users/yankeson/miniconda3/envs/DL/lib/python3.10/site-packages/torch/nn/modules/linear │
│ .py:114 in forward                                                                               │
│                                                                                                  │
│   111 │   │   │   init.uniform_(self.bias, -bound, bound)                                        │
│   112 │                                                                                          │
│   113 │   def forward(self, input: Tensor) -> Tensor:                                            │
│ ❱ 114 │   │   return F.linear(input, self.weight, self.bias)                                     │
│   115 │                                                                                          │
│   116 │   def extra_repr(self) -> str:                                                           │
│   117 │   │   return 'in_features={}, out_features={}, bias={}'.format(                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method
wrapper_mm)

I did check that all arguments in forward_unet are indeed on cuda, so I couldn't figure out why there's still error.

If it doesn't happen on other environments I don't want to bother you too much, but thanks though!

bennyguo commented 1 year ago

Will this happen using the default configurations?

yankeesong commented 1 year ago

No. It works for both default configurations for dreamfusion and prolificdreamer. I was just trying to test whether VSD can be conveniently applied to other systems.

thuliu-yt16 commented 1 year ago

Have you checked t? I remember I had the problem once too.

thuliu-yt16 commented 1 year ago

Sorry for the late reply. I ran your config and also encountered the problem. I dived into it and found that it is because both camera_embedding and lora_attn_procs are initialized on cpu. In addition, dreamfusion-system initialized the guidance and prompt_processor in the on_fit_start hook rather than in the system's configure method. So the pytorch-lightning will not move all the modules in the system onto GPU and you got the "not on the same device" error.

A quick fix is to move the construction of guidance and prompt_processor to configure method in dreamfusion.py. We will consider moving them or adding some manual device allocation in future updates.

Also, your config still cannot run because you need a material that includes normal to apply orient loss. So you may want to set lambda_orient to 0 or switch to another material, e.g. the default one for dreamfusion: diffuse-with-point-light-material. Hope it helps! Feel free to post here once you find other bugs when running your custom config.

yankeesong commented 1 year ago

Got it. Thanks so much for the response! I'll leave this open in case you want to implement the proposed change.

threestudio-project / threestudio

Not on same device error #186