ramanathanlab / mdlearn

Machine learning for molecular dynamics
MIT License
12 stars 7 forks source link

len(train_loader) is zero #69

Closed atbogetti closed 11 months ago

atbogetti commented 11 months ago

Describe the bug Hi Alex! After preprocessing my data with SD2/SD4, I try to train a linear autoencoder with the data, but I keep getting an error ZeroDivisionError: float division by zero for the line --> 523 avg_loss /= len(valid_loader). I changed the train_loader to be train_loader.dataset and I get a value for the length, but I'm wondering if that is not the correct thing to do in this case.

To Reproduce Steps to reproduce the behavior: I followed the few lines of code on the mdlearn GitHub with a 50x60 dimensional vector on a single CPU. mdlearn was installed through pip and manually after that. Torch versions were >2 at first but this also happens with 1.13.1.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context None

atbogetti commented 11 months ago

After trying a few things, I found I just needed to decrease my batch_size since I don't have too many data points in my first iteration of training my model. I was wondering, what is the batch_size in this case? Is it okay for me to reduce it to 1 for my smaller initial dataset?

braceal commented 11 months ago

Hi Anthony! Yes, this is actually a really common issue so thanks for posting this!

Problem It's exactly as you say, the batch size is too large for the current dataset.

Explanation The batch size is essentially the number of data examples that the model averages together to compute the loss during a training step. This is an important hyperparameter, especially in autoencoders. It's usually set to some power of 2, such as 16, 32, 64, etc.

This error message could be better, but what's happening is that the data is randomly split between train and validation sequences according to a configurable percentage split_pct, in this case since there are only 50 training examples and the batch size is greater than 50, there is not enough data to form a single validation batch, so when the average loss is calculated by dividing the running loss sum by the number of validation examples, it raises a ZeroDivisionError error.

Solutions The simplest is using a smaller batch size, but not too small. I'd start with 16 to see if that helps. More data would help too 🙂

Let me know if you have further issues!

atbogetti commented 11 months ago

Thank you very much for the detailed response, Alex. This all makes sense. I will adjust accordingly for my dataset.