Closed jalane76 closed 1 year ago
Hi @jalane76, the dataset that SupervisedNE expects is one that is fully compatible with pytorch's DataLoader. Have you tried something like this:
sample = next(iter(torch.utils.data.DataLoader(dataset, batch_size=32)))
And it works?
If it doesn't, you might get a recursive error, as per the source code:
def get_minibatch(self) -> Any:
...
try:
batch = next(self.dataloader_iterator)
if batch is None:
self.dataloader_iterator = iter(self.dataloader)
batch = self.get_minibatch()
Thank you for your reply. I've just now gotten back to this issue.
It turns out that I was indeed having difficulties with the first dataloader. I have a fixed dataloader that is producing sensible output when I get the next batch. The same dataloader is being used by the YOLO trainer to train the model using their built-in scripts.
Unfortunately, I still get a similar overflow error when trying to run SNES on a SupervisedNE problem using this dataloader. I've done some additional testing and it seems like the end of my dataset is being reached and throwing a StopIteration
(which appears to be expected behavior from a PyTorch DataLoader). I'm not entirely sure that this is the exact error occuring when I use EvoTorch because I don't actually see an error, most likely due to the stack trace not being reliable because of the stack overflow.
So, I'm still having trouble using SupervisedNE. In addition, I'm having a hard time debugging because it appears that overflowing the stack upon an error is how the get_minibatch
function has been designed. I question the wisdom of this choice since
Maybe there is a good reason you've made this design choice, but I'm having trouble understanding it. It seems like you could do a simple check for a null dataloader_iterator
and allow the exception catching to propagate or raise whatever errors you deem appropriate immediately instead of recursively calling get_minibatch
.
I am not very experienced with PyTorch DataLoaders, so maybe I am still making a mistake somewhere. Any further suggestions you may have would be great. Thanks.
Hello @jalane76, and thank you for raising this issue! Also, thank you @maulberto3 for your helpful remarks!
We just made a new branch named fix/supervisedne
(https://github.com/nnaisense/evotorch/tree/fix/supervisedne), where the get_minibatch()
method does not use recursion to handle the end of data loader's minibatches. Would you like to try installing EvoTorch from that branch and try your example script again?
Also, in this new branch, you might want to take a look at the updated Training_MNIST30K.ipynb
example where we changed the algorithm to PGPE and adopted hyperparameters from what we reported in the technical report. Perhaps you might want to start configuring your algorithm from there.
A few comments regarding your example code:
device="cuda"
. Perhaps this will be more performant. Your entire GPU will be used without any need for interprocess communication between the actors.device="cpu"
. This will put the main population on the CPU, but the remote actors should still use GPU thanks to num_gpus_per_actor=1/4
.PIL.Image.Image
objects, not PyTorch tensors, which causes an error. I think that the Dataset needs to be configured so that it transforms the images to tensors. I am not sure, but perhaps something like this is required:from torchvision import transforms
...
dataset = CocoDetection(coco_path, ann_path, transform=transforms.ToTensor())
Would you like to try with these suggestions, after switching to this new branch of EvoTorch? Feel free to let me know if something is not clear.
Hello @jalane76!
The pull request addressing this issue just got merged. The latest state of EvoTorch with the mentioned fix can now be installed from the repository via:
pip install git+https://github.com/nnaisense/evotorch
Thank you so much! I had meant to get back and try out the new branch, but I've been furiously writing a prelim so haven't had time. I'll get back to this soon.
Hello, I'm trying out evotorch to eventually do some black box optimization on the YOLOv8 model. However, I've run into the following error when trying to run SNES on a SupervisedNE problem on the COCO dataset.
Here is the console output:
And here is the code I'm using:
Looking at the source code for get_minibatch in SupervisedNE it appears that if the batch returns
None
or there is an exception then get_minibatch is called again. This would appear to lead to an infinite regress until overflow and no underlying error is thrown to let me know what the problem is. I tried it on a few other datasets as well and got the same result.Thank you!