Closed yifan123 closed 2 months ago
Unfortunately I wasn't able to reproduce the problem. This kind of problem can be caused by a particular combination of GPU model and PyTorch/HuggingFace/DeepSpeed/CUDA version. I'd suggest re-try with a new conda environment and install the latest version of all dependencies.
Hi, Thanks for your work. I want to reproduce the training process of the Premise Retriever. There are no issues during training, but there is a bug during testing. I followed the instructions in the README for installation, and it seems like there is a GPU memory access out-of-bounds error.
My script:
python retrieval/main.py fit --config retrieval/confs/cli_lean4_random.yaml --trainer.logger.name train_retriever_random --trainer.logger.save_dir logs/train_retriever_random
Env:
lean-dojo 2.0.3 torch 2.3.0 deepspeed 0.14.5 reprover newest
Outputs: