Open bolipop opened 9 months ago
Hmm. One thing I'm wondering is if you can try just looping over the data without using MLX. Just to make sure this is an MLX issue and not something to do with the datasets package you are using.
Also good to monitor your memory as you do so see if there is a leak or if you are using way too much. (Use activity monitor or asitop).
Hmm. One thing I'm wondering is if you can try just looping over the data without using MLX. Just to make sure this is an MLX issue and not something to do with the datasets package you are using.
Also good to monitor your memory as you do so see if there is a leak or if you are using way too much. (Use activity monitor or asitop).
When I uncomment these lines, I'm able to loop through the entire dataset just fine.
_loss, grads = loss_helper(model, images, labels)
optimizer.update(model, grads)
mx.eval(model.parameters(), optimizer.state)
Looking at the memory usage, I suspect it's due to out of memory
It doesn't look to be out of memory. And it definitely shouldn't segfault. Does it segfault reliably for you? How far into the training?
I'm running your script on an M1 Max with 32 GB. So far no segfault 🤷♂️ , I'm at iteration 600. Did it segfault before that?
Also what's your OS? What version of MLX are you using? (Commit hash if from source?)
Sonoma 14.2.1 M2 Max 32 GB Python 3.11.7
Yeah, I've had it segfault right away before, it's very sporadic. Sometimes it just hangs and I have to go and kill the process manually.
mlx ❯ python3 mobilenet/train.py
9it [00:04, 2.30it/s]zsh: segmentation fault python3 mobilenet/train.py
/Users/bento/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
~/Repos/mlx-stuff/mlx-playground main* 8s
mlx ❯ python3 mobilenet/train.py
69it [01:18, 2.02it/s]zsh: segmentation fault python3 mobilenet/train.py
/Users/bento/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
~/Repos/mlx-stuff/mlx-playground main* 1m 22s
mlx ❯ python3 mobilenet/train.py
12it [00:05, 2.29it/s]
It doesn't look to be out of memory. And it definitely shouldn't segfault. Does it segfault reliably for you? How far into the training?
you're right, I thought the little widget on the right was tracking memory
What about your MLX version (or commit hash if building from source)?
0.0.6
Not sure if it helps but earlier I saw a bus error
instead of a seg fault.
I can't reproduce it either. I left it running for about an hour on my M2 air. My initial thought was that it had to do with the implementation of separable convolution which ends up having 1000 layers and concatenating 1000 arrays but it doesn't seem to cause a problem at all.
@bolipop If you could run it again and let us know if it still defaults for you with the latest MLX that would useful. I believe we’ve fixed the underlying issue but hard to be sure since we never reproduced this exact one
Hello, I'm trying to train a simple network (mobilenet classifier) which seems fine but I'm getting a segfault after a few batches. Hoping maybe someone can point out what I'm doing wrong or some pointers to debug the seg fault since it just errors out with no decent traceback. Thanks!
Macbook Pro M2 Max 32GB
21it [00:11, 1.81it/s]zsh: segmentation fault python3 mobilenet/main.py /Users/bento/.pyenv/versions/3.11.7/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '