mosaicml / composer

Supercharge Your Model Training
http://docs.mosaicml.com
Apache License 2.0
5.16k stars 419 forks source link

Saving checkpoints on network drive fails due to symlinks #2942

Open eldarkurtic opened 9 months ago

eldarkurtic commented 9 months ago

Hi folks, I am using llm-foundry to train some LLMs, and trying to save checkpoints directly to network drive (AWS on-prem storage). The issue I am hitting looks like this:

[Errno 524] Unknown error 524: 'ep0-ba2-rank0.pt' -> '/network/eldar/llmfoundry_checkpoints/test_x/latest-rank0.pt'

at the line:

File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 352, in _save_checkpoint
    os.symlink(os.path.relpath(src_path, os.path.dirname(symlink)), symlink)

FYI: saving on a local disk works just fine. I think this is an issue of not being able to create symlinks on the network drive. For example, running: touch test1.txt && ln -s test1.txt test2.txt, results with the same Unknown error 524.

I was wondering whether you have any suggestion on how to bypass this restriction (?) of not being able to create symlinks on network drives. If not, is there a straight-forward way to save checkpoints on the network drive but keep symlinks on local disks. After digging a bit through the Composer lib, I feel that this could be hacked relatively easy but I'm wondering if you think that might break some other parts of either Composer or llm-foundry.

mvpatel2000 commented 9 months ago

You can specify save_latest_filename to keep the symlink on your local disk if that works for you. That seems like the easiest solution.

For object stores, we emulate a symlink by creating a file that has the path to the checkpoint in it's contents. We could try building a similar solution for a network drive -- this seems like the "right" solution. Unfortunately, it's not something we will be able to build since we don't have access to network drives to test this, but I'm happy to work with you and give some guidance if you're interested.