Open eldarkurtic opened 9 months ago
You can specify save_latest_filename
to keep the symlink on your local disk if that works for you. That seems like the easiest solution.
For object stores, we emulate a symlink by creating a file that has the path to the checkpoint in it's contents. We could try building a similar solution for a network drive -- this seems like the "right" solution. Unfortunately, it's not something we will be able to build since we don't have access to network drives to test this, but I'm happy to work with you and give some guidance if you're interested.
Hi folks, I am using llm-foundry to train some LLMs, and trying to save checkpoints directly to network drive (AWS on-prem storage). The issue I am hitting looks like this:
at the line:
FYI: saving on a local disk works just fine. I think this is an issue of not being able to create symlinks on the network drive. For example, running:
touch test1.txt && ln -s test1.txt test2.txt
, results with the sameUnknown error 524
.I was wondering whether you have any suggestion on how to bypass this restriction (?) of not being able to create symlinks on network drives. If not, is there a straight-forward way to save checkpoints on the network drive but keep symlinks on local disks. After digging a bit through the Composer lib, I feel that this could be hacked relatively easy but I'm wondering if you think that might break some other parts of either Composer or llm-foundry.