Closed juancq closed 1 year ago
I got the pretraining script to complete the first epoch without errors when using num_data_loader_workers < 3. With num_data_loader_workers equal to 3 or 4, I was able to replicate the error above (occurring during the first epoch).
I am using a batch size of 32 and a validation batch size of 32.
@juancq is this on the current main branch of the code? Or on the dev branch, or a modified version? And is it possible some of your subjects don't have any static data observed? That is what is causing the issue, based on the error, but I've not encountered that situation on any of my datasets so want to make sure that is expected. Either way, it should be a relatively simple fix; I've pushed some code to possibly fix it here: https://github.com/mmcdermott/EventStreamGPT/compare/dev...fix_static_data_bug?expand=1, though as I don't have a test case for this issue I can't be sure, but you can try this branch out and see if it works? Note that this branch, being derived from dev, has some recent changes to dataset structure, so it may conflict with any existing datasets you have created and saved to disk. If that is an issue, I can help migrate your datasets over, though it may be best just to re-run your dataset creation script to get a new slice, unless you are too resource constrained.
Additionally you can try adding just the delta shown in the link above to your local code and see if that solves it. If it does, it'd be great to get a test case for this as well.
@mmcdermott This was on the dev branch. Your code 29c29f13d732468ccb217b559439b2abc41d25b9 fixed the issue.
Great! Then this should be fixed in dev by #67 . Let me know if you're still seeing any issues.
When running pretraining, the program crashes in the middle of the first epoch. The error seems to be coming from here:
https://github.com/mmcdermott/EventStreamGPT/blob/bb689ae8f95aef2ebb243d0ba06e423eefee9d90/EventStream/data/pytorch_dataset.py#L534-L534
Here is the error trace:
I have not been able to consistently replicate it. I got pretraining to run to completion once. I am unsure what breaks it, whether it was the number of dataloader worker processes, the batch size, or a combination of other factors.