Open jq-wei opened 3 weeks ago
I also have this problem when run this scripts on a server with multi GPUs. But I can run successfully for GPT example. Do u have any solution for this? It seems that this code can not running successfully in single server environment
Hi,
First of all, thank you for the great work.
I am trying the llama example script with llama2-7b-hf and the following key packages:
When I run
torchrun --nproc-per-node 4 pippy_llama.py
, I got the following error on device 0 :I can trace back to
_number_and_count_forward_stages
in_IR.py
and indeed thenum_stages = 1
due to there is only onenode.op == "call_module"
, and all the othernode.op == "call_function"
.Just for the sake to go deeper, I hard code the return in
_number_and_count_forward_stages
to be 4. Then I got the following errorIt seems the version matching problem is still there. By the way, the same problems happen if I uninstall torchpippy.
Could you give me some hints?
Thank you very much!