Closed PanYuQi66666666 closed 8 months ago
Hi,
I'm hoping to give you some quick insight into the cost. In our experiments, the model was trained on a single A100. We spent 5 days for the entire training procedure. In detail, 10 hours were spent on the first cross-entropy training (step 2 in the readme). 14 hours on the end-to-end cross-entropy (step 3 in the readme). 18 hours were spent on Reinforcement Learning (step 5 in the readme) and a final 6 hours were dedicated to end-to-end reinforcement training (step 6).
Which accounts for roughly 2 days of total training hours. Keep in mind there are feature generation steps (step 1 and step 4) which can take a reasonable amount of time (several hours) but that part is unoptimized as it wasn't the focus of our work, so it can be improved a lot.
Now, this was on the A100, if you have GPUs with less memory you might need to split the batch size, which increases the accumulation steps and possibly slows down the training. Ideally, any GPU that allows fitting the end-to-end model should be ok, we used the A100 but also an NVIDIA 3070 RTX with 8GB should be enough. Note that it supports multi-GPU training, so if you have several smaller GPUs that should help a lot. In case you have even more performing GPU than the A100 it should be less than 2 days.
I hope this was useful, also excuse me but I'll probably be able to reply in a few days if more details are needed.
Best Regards. Jia Cheng
@jchenghu Hi, it's me again. I want to know where the requirements. txt file for the code is. Could you please provide this file? So that I can configure the environment on the server.
Hi,
Sure thing, this is the output of pipreqs requirements.txt
this is the output
h5py==3.8.0
numpy==1.21.6
onnx==1.14.0
onnxruntime==1.14.1
onnxsim==0.4.17
Pillow==9.0.1
Pillow==10.2.0
pycuda==2022.1
tensorrt==8.0.1.6
torch==1.12.1+cu113
torchvision==0.10.0+cu111
note however there are some libraries that are specific for TensorRT, you might not need them (or have likely issue installing them) so you might prefer this version instead:
h5py==3.8.0
numpy==1.21.6
Pillow==9.0.1
Pillow==10.2.0
torch==1.12.1+cu113
torchvision==0.10.0+cu111
I will load the requirements.txt also in the project, thank you for pointing this out
Please let me know if everything is fine so I can close the issue and proceed to upload the requirements file. In case you encounter a different problem feel free to open a new issue
@jchenghu OK,thanks! If there is any question, I will ask again.
hello,I'd like to know approximately how much computing power is required for training code?