jjihwan / SV3D-fine-tune

Fine-tuning code for SV3D
MIT License
57 stars 5 forks source link

Model Loading Issues and Data Format Questions #8

Open MELANCHOLY828 opened 1 week ago

MELANCHOLY828 commented 1 week ago

image Hello, I would like to ask about an issue I encountered while running train_sv3d.py for pre-training; it seems to be related to network problems. Could you please advise on possible solutions?

Additionally, regarding the training data, what should the data format be? Would it be possible to provide a sample directory structure for reference? Thank you very much!

jjihwan commented 1 week ago
  1. It seems your error is fundamentally "runtime error" caused in here. Ignore the request error and find what cause runtime error in your log.

  2. data format should be .pt files. Please refer to "Notes" in the README file!

MELANCHOLY828 commented 1 week ago
  1. It seems your error is fundamentally "runtime error" caused in here. Ignore the request error and find what cause runtime error in your log.
  2. data format should be .pt files. Please refer to "Notes" in the README file!

image Hello, I have provided a screenshot of the part of the code that reads video_latent, as well as the corresponding file path format. I would like to ask if the code should be reading orbit_frame.pt instead of video_latent.pt? This is because when I run for k in data.datasets:, it doesn't execute successfully. image

jjihwan commented 1 week ago

You right. The video_latent.pt files should be changed as orbit_frame.pt. Sorry for my mistake, and I will change the README now. Thanks!

ys830 commented 1 week ago

You right. The video_latent.pt files should be changed as orbit_frame.pt. Sorry for my mistake, and I will change the README now. Thanks!

Hello, we are currently trying to reproduce your fine-tuning work on the objaverse dataset. The input image size is 576x576, and the dimension of the video_latent.pt is [21, 4, 72, 72]. All other configurations are the same as those provided in your code. We used two A100 and two A6000 GPU, but we still encountered OOM errors. The README mentions that you ran it on a single A6000. Could you please provide some troubleshooting suggestions? 7059c2e04054f234c48c3d236286c5e

jjihwan commented 1 week ago

It seems really strange. What's your batch size and dtype?

Also try to run in a single GPU by setting cuda visible devices

MELANCHOLY828 commented 1 week ago

这看起来确实很奇怪。您的批次大小和数据类型是多少?

还可以尝试通过设置 cuda 可见设备在单个 GPU 中运行

Hello! Thank you very much for your response. I have a question regarding fine-tuning: when fine-tuning, does the latent input to the network always include latents from all views (21 in total) each time? I’m also not quite sure why there are multiple directories under the input path, such as 000-000, 000-001. Are the contents inside these directories completely identical? Looking forward to your reply.

jjihwan commented 1 week ago

Each latent should contain all 21 views, as SV3D generates 21 frames at once. It differs with other NVR model such as zero123 that generates 1 frame at once.

So each folder represents 1 datapoint(21 frames). The .pt file will be the groundtruth and .png file will be the input frame.

ys830 commented 1 week ago

It seems really strange. What's your batch size and dtype?

Also try to run in a single GPU by setting cuda visible devices

We used the same sv3d_p.yaml as you, with batch_size = 1. And the dtype of input x is float32

jjihwan commented 1 week ago

It seems really strange. What's your batch size and dtype?

Also try to run in a single GPU by setting cuda visible devices

We used the same sv3d_p.yaml as you, with batch_size = 1. And the dtype of input x is float32

Did you try fp16 or bf16? Also try to install accelerate!

ys830 commented 1 week ago

Did you try fp16 or bf16? Also try to install accelerate!

Thank you very much for your prompt response! I used the following code to load the VAE model. vae = AutoencoderKLTemporalDecoder.from_pretrained("/data/yisi/mywork/SV3D-fine-tune/cheeckpoints/stable-video-diffusion-img2vid-xt/vae").to("cuda")

Here are the files in my directory. image This is the config.json

image I'm not quite sure if it is using fp16. Could you take a look and see if there's anything wrong here?

jjihwan commented 1 week ago

Ah I meant that I’m wondering whether you use FP16 or FP32 during training, not for generating latents.

Did you detach the gradient before saving the latents after decode? I think that's a possible case.

ys830 commented 1 week ago

generating

During training,I ues precision:'16-mixed, the same as sv3d_p.yanl

I alse use detach before saving the latents like this: 925c0956d66bf3570dce9c105169d66

jjihwan commented 1 week ago

Then I have no more idea with your results.. :( Please inform me if you solve the problem

ys830 commented 1 week ago

Then I have no more idea with your results.. :( Please inform me if you solve the problem

I've been following your workflow except for the data processing part. Could you share some of your processed latent data and images? It would really help me troubleshoot the issue.

My email : si.yi@smail.nju.edu.cn. Looking forward to hearing from you!

jjihwan commented 1 week ago

I really want so, but the server I did this project is already dead so I can't find the latents. sorry :(

ys830 commented 1 week ago

I really want so, but the server I did this project is already dead so I can't find the latents. sorry :(

I'm really sorry to hear that. ≧ ﹏ ≦ During the debugging process, we found that although we set precision: '16-mixed' in the .yaml file, the weights in the network remained in float32 because both the input latent.pt and the pre-trained sv3d_p.safetensors have a dtype of float32. Could this be the reason for the OOM? Did you use float16 dtype for your data during training?

jjihwan commented 1 week ago

No I used fp16 mixed precision training, so it is natural that the model and input are processed in fp32. But I recommend you trying to use fp16 rather than fp16-mixed for debugging

jjihwan commented 1 week ago

Also since you have two or more gpus, try to use deepspeed for model parallelization! :)