Not saving Phi3 Mini LoRA Adapter Checkpoint

pbontrager commented 4 months ago

In the HF Checkpointer, we warn the user that the adapter weights can't be converted to the PEFT format and will be converted to a torchtune format, but then we never save the adapter. code

zjost commented 1 month ago

When trying to LoRA fine-tune Phi3-mini with lora_finetune_distributed recipe, I get the following error when it goes to save the model. Wondering if it's related to the above.

E1007 19:07:40.669000 140228583184192 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 19446) of binary: /home/zak_jost/bin/python3.11
Traceback (most recent call last):
  File "/home/zak_jost/bin/tune", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/zak_jost/lib/python3.11/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/zak_jost/lib/python3.11/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/zak_jost/lib/python3.11/site-packages/torchtune/_cli/run.py", line 194, in _run_cmd
    self._run_distributed(args, is_builtin=is_builtin)
  File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/zak_jost/lib/python3.11/site-packages/torchtune/_cli/run.py", line 95, in _run_distributed
    run(args)
  File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/zak_jost/lib/python3.11/site-packages/recipes/lora_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-07_19:07:40
  host      : XXXX
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 19446)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 19446

I'm not quite sure how to troubleshoot with that traceback.

This brings up a separate question, which perhaps better belongs in a different thread, but: how would I go about using e.g., the VS Code debugger? Normally, I'd add a breakpoint to the relevant recipe line, but in this case, I'm not even sure where the recipe source code is once torchtune is installed. I.e., I see the recipe in the repo, but not within the site-packages once I install it, which I assume is the code I would need to add debugger breakpoints to.

ebsmothers commented 1 month ago

@zjost not sure if it's related to this issue, since for Phi-3 we just skip the adapter weights save. But yeah I agree, unfortunately this is not a very useful traceback. For breakpoints in distributed runs you can try out torch.distributed.breakpoint() (see here).

Regarding getting access to the actual recipe code: the recipes folder is not officially part of the importable package you get with pip install, but it is part of the bundle that gets downloaded. This means that if torchtune sits in e.g. /my/path/to/conda/torchtune you should be able to find the recipe files in /my/path/to/conda/recipes. But the easier thing to do is probably just copy the recipe to a local path, then run your local version. That would look something like this:

tune cp lora_finetune_distributed my_local_lora_finetune_distributed.py

Then add the breakpoints locally, and you should be able to run your previous command but with lora_finetune_distributed replaced by my_local_lora_finetune_distributed. (Note that we are landing some changes here, so eventually you will probably need to use my_local_lora_finetune_distributed.py in the tune run command instead. But for the time being I think you will want the module name, not the file name.)

zjost commented 1 month ago

Thanks for all the info! And you're right, my issue is unrelated (also happens with Llama2), so I opened #1762 to track it.

pytorch / torchtune

Not saving Phi3 Mini LoRA Adapter Checkpoint #1209