Open yankeesong opened 1 year ago
Hi, @yankeesong. Could you please provide further information about this error, such as where it occurs, and details about your running environment?
Hi, @DSaurus,
I am running on a Linux PPc64le cluster, in my own conda environment with python=3.10, cuda=11.4, pytorch=1.12.1. I am running the following (modified) config:
name: "dreamfusion-sd-test"
tag: "${rmspace:${system.prompt_processor.prompt},_}"
exp_root_dir: "outputs"
seed: 0
data_type: "random-camera-datamodule"
data:
batch_size: 1
width: 64
height: 64
camera_distance_range: [1.5, 2.0]
fovy_range: [40, 70]
elevation_range: [-10, 90]
#light_sample_strategy: "dreamfusion"
eval_camera_distance: 2.0
eval_fovy_deg: 70.
system_type: "dreamfusion-system"
system:
geometry_type: "implicit-volume"
geometry:
radius: 2.0
normal_type: "analytic"
# the density initialization proposed in the DreamFusion paper
# does not work very well
# density_bias: "blob_dreamfusion"
# density_activation: exp
# density_blob_scale: 5.
# density_blob_std: 0.2
# use Magic3D density initialization instead
density_bias: "blob_magic3d"
density_activation: softplus
density_blob_scale: 10.
density_blob_std: 0.5
pos_encoding_config:
otype: HashGrid
n_levels: 16
n_features_per_level: 2
log2_hashmap_size: 19
base_resolution: 16
per_level_scale: 1.447269237440378 # max resolution 4096
material_type: "no-material"
material:
n_output_dims: 3
color_activation: sigmoid
background_type: "neural-environment-map-background"
background:
color_activation: sigmoid
renderer_type: "nerf-volume-renderer"
renderer:
radius: ${system.geometry.radius}
num_samples_per_ray: 512
prompt_processor_type: "stable-diffusion-prompt-processor"
prompt_processor:
pretrained_model_name_or_path: "stabilityai/stable-diffusion-2-1-base"
prompt: ???
guidance_type: "stable-diffusion-vsd-guidance"
guidance:
pretrained_model_name_or_path: "stabilityai/stable-diffusion-2-1-base"
pretrained_model_name_or_path_lora: "stabilityai/stable-diffusion-2-1"
guidance_scale: 7.5
min_step_percent: 0.02
max_step_percent: 0.98
loggers:
wandb:
enable: false
project: 'threestudio'
name: None
loss:
lambda_vsd: 1.
lambda_lora: 1.
lambda_orient: [0, 10., 1000., 5000]
lambda_sparsity: 1.
lambda_opaque: 0.
optimizer:
name: Adam
args:
lr: 0.01
betas: [0.9, 0.99]
eps: 1.e-15
params:
geometry.encoding:
lr: 0.01
geometry.density_network:
lr: 0.001
geometry.feature_network:
lr: 0.001
trainer:
max_steps: 10000
log_every_n_steps: 1
num_sanity_val_steps: 0
val_check_interval: 200
enable_progress_bar: true
precision: 16-mixed
checkpoint:
save_last: true # save at each validation time
save_top_k: -1
every_n_train_steps: ${trainer.max_steps}
The first error happens here:
self.camera_embedding = ToWeightsDType(
TimestepEmbedding(16, 1280).to(self.device), self.weights_dtype
)
but it can be resolved by sending to self.device (which I already did)
The next error happens here:
noise_pred_est = self.forward_unet(
self.unet_lora,
latent_model_input,
torch.cat([t] * 2),
encoder_hidden_states=text_embeddings,
class_labels=torch.cat(
[
camera_condition.view(B, -1),
torch.zeros_like(camera_condition.view(B, -1)),
],
dim=0,
),
cross_attention_kwargs={"scale": 1.0},
)
with the following (truncated) error message:
│ /nobackup/users/yankeson/miniconda3/envs/DL/lib/python3.10/site-packages/torch/nn/modules/linear │
│ .py:114 in forward │
│ │
│ 111 │ │ │ init.uniform_(self.bias, -bound, bound) │
│ 112 │ │
│ 113 │ def forward(self, input: Tensor) -> Tensor: │
│ ❱ 114 │ │ return F.linear(input, self.weight, self.bias) │
│ 115 │ │
│ 116 │ def extra_repr(self) -> str: │
│ 117 │ │ return 'in_features={}, out_features={}, bias={}'.format( │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method
wrapper_mm)
I did check that all arguments in forward_unet
are indeed on cuda, so I couldn't figure out why there's still error.
If it doesn't happen on other environments I don't want to bother you too much, but thanks though!
Will this happen using the default configurations?
No. It works for both default configurations for dreamfusion and prolificdreamer. I was just trying to test whether VSD can be conveniently applied to other systems.
Have you checked t
? I remember I had the problem once too.
Sorry for the late reply. I ran your config and also encountered the problem. I dived into it and found that it is because both camera_embedding
and lora_attn_procs
are initialized on cpu. In addition, dreamfusion-system
initialized the guidance
and prompt_processor
in the on_fit_start
hook rather than in the system's configure
method. So the pytorch-lightning will not move all the modules in the system onto GPU and you got the "not on the same device" error.
A quick fix is to move the construction of guidance
and prompt_processor
to configure
method in dreamfusion.py
. We will consider moving them or adding some manual device allocation in future updates.
Also, your config still cannot run because you need a material that includes normal to apply orient loss. So you may want to set lambda_orient
to 0 or switch to another material, e.g. the default one for dreamfusion: diffuse-with-point-light-material
. Hope it helps! Feel free to post here once you find other bugs when running your custom config.
Got it. Thanks so much for the response! I'll leave this open in case you want to implement the proposed change.
I am doing some ablation study by trying to use VSD guidance in dreamfusion-sd. However there is a bunch of "not on the same device" error. These errors are not there if I use prolificdreamer instead.