Issues with training M3GNet potential on GPUs.

txy159 commented 10 months ago

Email (Optional)

No response

Version

v0.8.5 and v0.7.1

Which OS(es) are you using?

[ ] MacOS
[ ] Windows
[X] Linux

What happened?

Dear developers,

I'm trying to train a M3GNet potential using the same code in the tutorial (https://matgl.ai/tutorials%2FTraining%20a%20M3GNet%20Potential%20with%20PyTorch%20Lightning.html).

Training the potential on a CPU went smoothly without any issues. However, when I switched to a GPU node for training, I ran into several errors.

I made the following adjustments to the code to enable training on a GPU node.

trainer = pl.Trainer(max_epochs=1, accelerator="gpu", devices=[0], logger=logger, inference_mode=False) trainer.fit(model=lit_module_finetune, train_dataloaders=train_loader, val_dataloaders=val_loader)

Then the following error occurs,

I also tried to set the default device to one specific gpu, but I encountered another error:

Do you have any suggestions on fixing these errors ? Thanks in advance.

Code snippet

No response

Log output

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

shyuep commented 10 months ago

Please read the documentation on how to train on gpus. We have specifically added a section in the documentation. There is a proper pytorch way to do this.

txy159 commented 10 months ago

Thanks, I've already gone through the documentation before reaching out about this problem.

I found that change the "generator" parameter in Dataloader function solved this issue.

https://stackoverflow.com/questions/68621210/runtimeerror-expected-a-cuda-device-type-for-generator-but-found-cpu

data_loader = data.DataLoader( ..., generator=torch.Generator(device='cuda'), )

materialsvirtuallab / matgl