hitachinsk / FGT

[ECCV 2022] Flow-Guided Transformer for Video Inpainting
https://hitachinsk.github.io/publication/2022-10-01-Flow-Guided-Transformer-for-Video-Inpainting
MIT License
299 stars 31 forks source link

Organization of the YouTube-VOS dataset in the ‘myData’ folder for LAFC Net training? #19

Closed hwpengTristin closed 1 year ago

hwpengTristin commented 1 year ago

Hi, I have organized the YouTube-VOS dataset in the 'myData' folder, as shown below, for LAFC Network training $ cd LAFC $ python train.py

FGT/myData |- youtubevos_frames |- bmx-bumps |- <00000>.jpg |- <00001>.jpg |- youtubevos_flows |- backward_flo |- bmx-bumps |- <00000>.flo |- <00001>.flo |- forward_flo |- bmx-bumps |- <00000>.flo |- <00001>.flo

But the problem I'm having is that there is no such file or directory: '/myData/youtubevos_flows'. The logs are shown below.

using GPU 4-4 for training using GPU 3-3 for training using GPU 1-1 for training self.opt[datasetName_train] train_dataset_edge self.opt[datasetName_train] train_dataset_edge self.opt[datasetName_train] train_dataset_edge self.opt[datasetName_train] train_dataset_edge self.opt[datasetName_train] train_dataset_edge Traceback (most recent call last): File "train.py", line 70, in main(args_obj) File "train.py", line 59, in main mp.spawn(main_worker, nprocs=opt['world_size'], args=(opt,)) File "/raid2/hwpeng/miniconda3/envs/FGT_ENV/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/raid2/hwpeng/miniconda3/envs/FGT_ENV/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/raid2/hwpeng/miniconda3/envs/FGT_ENV/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/raid2/hwpeng/miniconda3/envs/FGT_ENV/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/raid2/hwpeng/Project_Coding/FGT/LAFC/train.py", line 30, in main_worker trainer = pkg.Network(opt, rank) File "/raid2/hwpeng/Project_Coding/FGT/LAFC/trainer.py", line 24, in init self.dataInfo, self.valInfo, self.trainSet, self.trainSize, self.totalIterations, self.totalEpochs, self.trainLoader, self.trainSampler = self.prepareDataset() File "/raid2/hwpeng/Project_Coding/FGT/LAFC/trainer.py", line 129, in prepareDataset train_set = create_dataset(dataset, dataInfo, phase, self.opt['datasetName_train']) File "/raid2/hwpeng/Project_Coding/FGT/LAFC/data/init.py", line 38, in create_dataset dataset = dataset_package.VideoBasedDataset(dataset_opt, dataInfo) File "/raid2/hwpeng/Project_Coding/FGT/LAFC/data/train_dataset_edge.py", line 29, in init self.train_list = os.listdir(self.data_path) FileNotFoundError: [Errno 2] No such file or directory: '/myData/youtubevos_flows'

hwpengTristin commented 1 year ago

I have organized the YouTube-VOS dataset in the 'myData' folder, as shown below, for LAFC Network training

FGT/myData |- youtubevos_frames |-|- bmx-bumps |-|-|- <00000>.jpg |-|-|- <00001>.jpg |- youtubevos_flows |-|- backward_flo |-|-|- bmx-bumps |-|-|-|- <00000>.flo |-|-|-|- <00001>.flo |-|- forward_flo |-|-|- bmx-bumps |-|-|-|- <00000>.flo |-|-|-|- <00001>.flo

hwpengTristin commented 1 year ago

**The code now works successfully, organising the data as above.

I used 6 GPUs (Tesla V100 16GB) to train the FGT model and found that it took me 10 hours but only ran for 5 epochs and 3232 iterations. The training details are as shown as following:**

22-11-30 09:06:13.689 - INFO: [epoch: 5, iter: 3200, lr:(1.000e-04, 1.000e-04, )] mean_loss: 2.8053e-01 mean_psnr: 27.30707745212495 mean_ssim: 0.9329263906482018 mean_l1: 13.470138085133744 mean_l2: 0.005076394467977023 22-11-30 09:09:12.083 - INFO: [epoch: 5, iter: 3216, lr:(1.000e-04, 1.000e-04, )] mean_loss: 2.7653e-01 mean_psnr: 27.714950755848747 mean_ssim: 0.9371985766617273 mean_l1: 12.699865040580704 mean_l2: 0.004905342148719834 22-11-30 09:10:52.628 - INFO: [epoch: 5, iter: 3232, lr:(1.000e-04, 1.000e-04, )] mean_loss: 2.9692e-01 mean_psnr: 27.26850985644835 mean_ssim: 0.9348241097911801 mean_l1: 12.600507330246913 mean_l2: 0.00498391306565287

Your published paper states that the number of training iterations for FGT is 500K on Youtube-VOS. I would like to know your training devices and training days for the FGT model.

Thank you for your amazing work!

hitachinsk commented 1 year ago

Thanks for your interest in our work. According to the training log, the convergence is normal but the speed is much slower than mine. The final version of FGT model takes about 9 days to train, and I used 4 Tesla V100 GPUs (16G) to train the 500K model. Maybe you can try some acceleration strategies to ameliorate the training speed, e.g. load the data to cache to improve the IO.

hitachinsk commented 1 year ago

image This is my training log, maybe it can become a reference.

hwpengTristin commented 1 year ago

Thank you for your valuabel suggestion. When I run the FGT model, I find that the GPU usage ratio is not high, as shown below. Is this normal?

image

hitachinsk commented 1 year ago

Maybe this is caused by your CPU usage. Since the CPU is blocked by IO and diffusion of the optical flows (the preprocessing we used in the flow completion stage), the GPU cannot be fully exploited.

hitachinsk commented 1 year ago

Nowadays, we are also exploring the strategy to accelerate the training process (because I think 9 days is also somewhat long for practical usage). A practical solution is that you can adopt spynet as the flow extractor during training (just as E2FGVI does), and replace it with the completed optical flows from LAFC during inference, because in such manner, you do not have to read extra optical flows during training, which may accelerate training.

hwpengTristin commented 1 year ago

Thanks for the advice on model training acceleration. Also, I look forward to your further strategies in accelerating the training process and it would be terrific if the training time could be further reduced.

hwpengTristin commented 1 year ago

I would like to know if the 9-day training includes 2-stage of training, i.e., LAFC network and FGT network, or only the FGT network requires 9 days of training.

Moreover, it seems that optical flows are extracted in advance by the LAFC network, and the dataset is arranged as follow before training the FGT network.

FGT/myData |- youtubevos_frames |-|- bmx-bumps |-|-|- <00000>.jpg |-|-|- <00001>.jpg |- youtubevos_flows |-|- backward_flo |-|-|- bmx-bumps |-|-|-|- <00000>.flo |-|-|-|- <00001>.flo |-|- forward_flo |-|-|- bmx-bumps |-|-|-|- <00000>.flo |-|-|-|- <00001>.flo

So, I don't understand the meaning "A practical solution is that you can adopt spynet as the flow extractor during training (just as E2FGVI does), and replace it with the completed optical flows from LAFC during inference, because in such manner, you do not have to read extra optical flows during training, which may accelerate training". Because in my understanding there is no need for optical flow extraction during the second stage of the FGT network training.

hitachinsk commented 1 year ago

9-days training is only for training of FGT, not the whole training time. However, the first stage is fast, only takes about two days.

Our flow completion process consists of two parts, the first part is the Laplacian filling, and the second part is the LAFC network. The Laplacian filling process requires to solve a Laplacian equation, which needs a powerful CPU. What's more, reading optical flows at full resolution may give more pressure to the IO, which may also lower the training speed (just as your training log). Therefore, if you adopt spynet as the flow completion (just as E2FGVI does) operator, you can avoid the LAFC during training (for more details please refer to the E2FGVI paper). But the flows completed by spynet are low quality compared to our completion results, and the performance may degrade. Therefore, during inference, you can adopt the optical flows completed from LAFC for a better performance.

hwpengTristin commented 1 year ago

Thanks so much for sharing the details, I now have a better understanding of your model. You have done a fantastic research work!

hitachinsk commented 1 year ago

You are welcome. I will close this issue. If you have any further questions, please feel free to ask. Thanks for your interest in our work.