threestudio-project / threestudio

A unified framework for 3D content generation.
Apache License 2.0
6.18k stars 474 forks source link

Memory for 512 ProlificDreamer #108

Closed ayaanzhaque closed 1 year ago

ayaanzhaque commented 1 year ago

Do you all have any details on how much memory is required to run the 512 resolution ProlificDreamer? I keep getting OOM errors and my GPU has 48GB of memory.

UMGrain commented 1 year ago

Do you all have any details on how much memory is required to run the 512 resolution ProlificDreamer? I keep getting OOM errors and my GPU has 48GB of memory.

+1

bennyguo commented 1 year ago

Hi guys. It's a known issue. Sorry I didn't point this out in README. I suspect it's because we use far more ray samples than the original paper, which easily causes OOM in early training. Could you try a lower num_ray_samples (default 512) and see if you get any luck? This could be a temporary solution and I'll try to come up with a better one later.

ayaanzhaque commented 1 year ago

Sounds good, will try it and let you know.

ayaanzhaque commented 1 year ago

Reducing the ray samples (to 256) seems to work, but training is incredibly slow, which make sense. I'll share some results once it runs for a while to see how it is. I have multiple GPUs, is there an efficient parallelize option?

bennyguo commented 1 year ago

@ayaanzhaque Great to hear that! I'm working on a progressive training strategy that gradually increases rendering resolution, which should be able to enable 512x512 training under 24GB VRAM. If you have multiple GPUs, you could simply set --gpu 0,1 and train for fewer iterations.

ayaanzhaque commented 1 year ago

@bennyguo Interesting solution! Curious to see how that works.

In terms for GPUs, will this speed up the iters/sec?

Also, just curious, have you thought about using Deepfloyd for ProlificDreamer? I assume it would be quite slow, and maybe there aren't any LoRA implementations of Deepfloyd out yet, I'm not sure.

bennyguo commented 1 year ago

Setting --gpu 0,1 will make each GPU process a batch, and will not speed up iters/sec. There's currently no easy way to split a batch into multiple GPUs (i.e., each GPU only renders part of the image). Although iters/sec is not getting any larger, you can train for less iterations, which is also a speedup :)

I do intent to try ProlificDreamer on DeepFloyd but haven't got the time yet. It shouldn't be much slower than ProlificDreamer+SD, and yeah I don't think there are any LoRA implementations of DeepFloyd (although it should be easy in diffusers).

ayaanzhaque commented 1 year ago

Ah ok, sounds good, thanks for the info.

For writing a LoRA implementation, can you point to where you found the LoRA implementation for SD? I'm interested in writing a LoRA implementation for Deepfloyd, but am pretty new to it so not sure how to get started.

bennyguo commented 1 year ago

In my VSD implementation, I directly take the LoRA implementation here. I think it works by iterating over all layers and adding LoRA modules as AttnProcsLayers. Looking forward to your DeepFloyd-LoRA!

bennyguo commented 1 year ago

@ayaanzhaque @UMGrain Hey guys, I've opened a PR to enable progressively increasing rendering resolution: we first train on 64x64 rendering for the first 1,000 iterations, then switch to 512x512 rendering for the remaining iterations. This should drastically save VRAM as the empty space is pruned in early steps. Could you try this branch and see if you could successfully train using a 40GB GPU? Note that you DON'T need to specify data.width=512 and data.height=512 in the command line, as it is now data.width=[64,512] and data.height=[64,512] for this progressive training strategy.

ayaanzhaque commented 1 year ago

I tested it, here's what I got:

Training to begin takes ~12 GB memory. After the 1k iterations, it ran out of memory unfortunately.

Question: Is the beginning section still using VSD? Just at 64x64?

bennyguo commented 1 year ago

Yes it's 64x64 VSD. Are you tested on a 40GB GPU? Could you increase the number 1000 in data.resolution_milestones to 5000 and try again?

ayaanzhaque commented 1 year ago

I'm testing on a 50 GB GPU yes. I'll try that yes!

Random question, where does all the saving occur in the code? I want to try and save out my LoRA model for some exploration, but unsure of where to perform the saving in the code.

bennyguo commented 1 year ago

The savings are in the train/validation/test_step functions in the system. If you want to save anything to the trial directory, you can only do it in these hooks.

ayaanzhaque commented 1 year ago

How much memory does the standard ProlificDreamer currently in the main branch take? I'm training it and its taking 40gb of memory, which seems quite high...

ayaanzhaque commented 1 year ago

Does it use 256x256 by default or 64x64?

bennyguo commented 1 year ago

It uses 256x256 by default. It's the same problem as with 512x512: if we could get the progressive strategy working, we should be able to train with much less VRAM.

bennyguo commented 1 year ago

@ayaanzhaque I improve the progressive training by (1) adding pruning by alpha threshold in the renderer; (2) increasing milestone step to 5000. Could you help me check if the latest code on the progressive-resolution branch can work under 48GB VRAM?

ayaanzhaque commented 1 year ago

Will test it in a bit!

ayaanzhaque commented 1 year ago

I'm training it now, will let you know how it goes.

ayaanzhaque commented 1 year ago

The first stage takes ~33GB of memory, and the second stage also takes around ~33GB of memory. I'll let it keep training, but do you know if this produces overall better results? Otherwise, it looks like it works fine.

ayaanzhaque commented 1 year ago

Question: is there a way to load in a dreamfusion model into the prolificdreamer pipeline?

ayaanzhaque commented 1 year ago

Also, by your estimations, how long does just phase 1 (25k iters) of the standard ProlificDreamer take at the default resolution?

bennyguo commented 1 year ago

The first stage takes ~33GB of memory, and the second stage also takes around ~33GB of memory.

Great to hear that the first stage now works under 48GB VRAM. However the second stage should not take that much VRAM 😂 It only takes ~7GB VRAM for the second stage in my experiments.

Question: is there a way to load in a dreamfusion model into the prolificdreamer pipeline?

Just set system.geometry_convert_from=path/to/your/dreamfusion/checkpoint system.geometry_convert_inherit_texture=false. In this way, you can load a dreamfusion model into any of the 3 stages of the prolificdreamer pipeline. Note that you may need to change data.camera_distance_range and system.model.radius as the dreamfusion pipeline and the prolificdreamer pipeline use different camera configurations.

Also, by your estimations, how long does just phase 1 (25k iters) of the standard ProlificDreamer take at the default resolution?

It takes ~4.5 hours on a single RTX 3090.

ayaanzhaque commented 1 year ago

Ah sorry, by second stage I mean the second part of stage 1, where it trains at 512x512 resolution. The mesh geometry part is definitely less memory.

Thanks for the details about the DreamFusion loading.