materialsvirtuallab / matgl

Graph deep learning library for materials
BSD 3-Clause "New" or "Revised" License
256 stars 59 forks source link

Issues with training M3GNet potential on GPUs. #197

Closed txy159 closed 10 months ago

txy159 commented 10 months ago

Email (Optional)

No response

Version

v0.8.5 and v0.7.1

Which OS(es) are you using?

What happened?

Dear developers,

I'm trying to train a M3GNet potential using the same code in the tutorial (https://matgl.ai/tutorials%2FTraining%20a%20M3GNet%20Potential%20with%20PyTorch%20Lightning.html).

Training the potential on a CPU went smoothly without any issues. However, when I switched to a GPU node for training, I ran into several errors.

I made the following adjustments to the code to enable training on a GPU node.

trainer = pl.Trainer(max_epochs=1, accelerator="gpu", devices=[0], logger=logger, inference_mode=False) trainer.fit(model=lit_module_finetune, train_dataloaders=train_loader, val_dataloaders=val_loader)

Then the following error occurs,

image

I also tried to set the default device to one specific gpu, but I encountered another error:

image image

Do you have any suggestions on fixing these errors ? Thanks in advance.

Code snippet

No response

Log output

No response

Code of Conduct

shyuep commented 10 months ago

Please read the documentation on how to train on gpus. We have specifically added a section in the documentation. There is a proper pytorch way to do this.

txy159 commented 10 months ago

Thanks, I've already gone through the documentation before reaching out about this problem.

I found that change the "generator" parameter in Dataloader function solved this issue.

https://stackoverflow.com/questions/68621210/runtimeerror-expected-a-cuda-device-type-for-generator-but-found-cpu

data_loader = data.DataLoader( ..., generator=torch.Generator(device='cuda'), )