About the training logs and the codes

navervision / lincir

Official Pytorch implementation of LinCIR: Language-only Training of Zero-shot Composed Image Retrieval (CVPR 2024)

Other

100 stars 5 forks source link

About the training logs and the codes #5

Closed Pefect96 closed 8 months ago

Pefect96 commented 9 months ago

I've run the training code, but I can't find the log file in logs folder. Besides, there may be a problem in the training code that the training loop does not seem to stop after max_train_steps, possibly because it does not exit the While True loop .(https://github.com/navervision/lincir/blob/b1ce7d283ab92c0f131972c71d5fed1ce54f23ac/train_phi.py#L222C1-L222C1)

geonm commented 9 months ago

Thank you for reporting this.

We'll address it as soon as possible next week.

geonm commented 9 months ago

As shown in the above image, the tensorboard logs will be stored in the directory output_dir/logs/zeroshot-cir.

Note that we utilized the command --output_dir ./large_test.

To handle this issue

Besides, there may be a problem in the training code that the training loop does not seem to stop after max_train_steps, possibly because it does not exit the While True loop .

We simply replaced break with exit(). Please see the PR https://github.com/navervision/lincir/pull/6

Pefect96 commented 9 months ago

Thank you for your reply. However, I use the command --output_dir ./large_test, only the phi_best.pt is saved in large_test/checkpoints/phi_best.pt, the tensorboard logs are not generated.

geonm commented 9 months ago

It's weird.

Are you currently employing the latest version from the master branch?

Could you specify the version of accelerate from huggingface that you have in use?

Ours is 0.26.1

Pefect96 commented 9 months ago

Yes, I employ the master branch and the version of accelerate is 0.26.1.

geonm commented 9 months ago

Oh... Okay.

Could you please share the script you used for training with us?

Pefect96 commented 9 months ago

Ok, I use the training instructions you provide: python -m torch.distributed.run --nproc_per_node 1 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 5100 train_phi.py --batch_size 256 --output_dir ./large_test --cirr_dataset_path /home/dev/datasets/cir/CIRR --mixed_precision fp16 --clip_model_name large --validation_steps 1000 --checkpointing_steps 1000 --seed 12345 --lr_scheduler constant_with_warmup --lr_warmup_steps 0 --max_train_steps 20000.

geonm commented 9 months ago

We've tried it several times but there were always tensorboard logs in the output_dir.

We used the docker image. nvcr.io/nvidia/pytorch:23.12-py3.

Pefect96 commented 8 months ago

My GPU type is 3090ti, and when I run the above command directly without any setting, it appears: NotImplementedError: Using RTX 3090 or 4000 series doesn't support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE="1" and NCCL_IB_DISABLE="1" or useaccelerate launch` which will do this automatically.

Thus, I have to set 'NCCL_P2P_DISABLE="1"' and 'NCCL_IB_DISABLE="1" to run the program normly. I don't know if this is the reason why logs cannot be generated. @geonm

In addition, the following warning appears when running: /python3.9/site-packages/accelerate/accelerator.py:393: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.

geonm commented 8 months ago

Thank you for sharing the details with us.

Yes. I think I have found a solution.

You may need to install tensorboard library to track your training logs.

Please run pip install tensorboard, and the trainer will save all tensorboard logs.

Pefect96 commented 8 months ago

Ok, thank you very much, this problem has been solved! Perhaps you can add it to the readme to remind other researchers to use it.