huggingface / optimum-habana

Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Apache License 2.0
142 stars 170 forks source link

Unable to run protein-folding example with the latest release: optimum-habana-1.11.1 #1014

Closed mgujral closed 2 months ago

mgujral commented 3 months ago

System Info

optimum-habana-1.11.1
optimum==1.19.1
SynapseAI version: 1.15.1
Docker image: vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest

When I launch the protein-folding example, I get the following message. However, with the older release of optimum-habana-1.9.0 and SynapseAI version 1.13  protein-folding example worked fine.

/home/mgujral/optimum-habana-1.11.1/transformers/utils/generic.py:462: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.reg
ister_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/mgujral/optimum-habana-1.11.1/transformers/utils/generic.py:319: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.reg
ister_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/mgujral/optimum-habana-1.11.1/transformers/utils/generic.py:319: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.reg
ister_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/mgujral/optimum-habana-1.11.1/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads alway
s resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Some weights of EsmForProteinFolding were not initialized from the model checkpoint at facebook/esmfold_v1 and are newly initialized: ['esm.contact_head.regression.bias', 'esm
.contact_head.regression.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 96
CPU RAM       : 527939920 KB
------------------------------------------------------------------------------
Synapse detected a device critical error that requires a restart. Killing process in 5 seconds (hl: 7) 02:10:00 [please check log files for dfa cause]

Information

Tasks

Reproduction

python3 run_esmfold.py

It does not have any requirements.

Expected behavior

The run should generate protein structure file with extension pdb.

regisss commented 3 months ago

@mgujral I cannot reproduce this. Are you running

python run_esmfold.py

in https://github.com/huggingface/optimum-habana/tree/main/examples/protein-folding?

mgujral commented 3 months ago
esmfold
mgujral commented 3 months ago

This is the easiest tool to run. I do not know what is going on?

mgujral commented 3 months ago
hl-smi
mgujral commented 3 months ago

I am using Gaudi HPU nodes.

libinta commented 3 months ago

@mgujral Can you collect logs with HABANA_PROFILE=./log LOG_LEVEL_ALL=0 before your cmd? And attach the log folder? Thanks.

mgujral commented 3 months ago

logsForGithub.tgz

These are logs before launching the command. Appreciate your help!!!

mgujral commented 3 months ago

I looked at the file 'dfa_log.txt', and it had the following message at top. Any thoughts what does it mean?

hl thunk API failed, errCode memcopyIoctlFailed

libinta commented 3 months ago

@mgujral it is related to host memory allocation. Can you watch the host memory with top and capture the screen from start to crash? Also what's the output of grep Huge /proc/meminfo ? you can also change https://github.com/huggingface/optimum-habana/blob/main/examples/protein-folding/run_esmfold.py#L85 low_cpu_mem_usage from False to True and retry. Another way is to request 8x instead of 2x to see if you can pass the test. Thanks, Libin

mgujral commented 3 months ago

When I try with 8-HPU cards, warning and error messages are same as before. Altering the line run_esmfold.py#L85 does not make any difference either. When I check memory usage with 'top' command, max memory usage goes too about 4.5%. I notice one more thing, when I try any application in language-modelling folder I get the identical 'Synapse detected a device critical error......' error message.

grep Huge /proc/meminfo

AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 48000 HugePages_Free: 48000 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 98304000 kB

libinta commented 3 months ago

when you said you try with 8-hpu card, you are still using python run_esmfold.py , not gaudi_spawn 8x, right? can you also do sudo dmesg -T >dmesg.txt? thx

mgujral commented 3 months ago

Yes, with 8-hpu card, I am still using python run_esmfold.py. Thank you for your patience & time. dmesg.txt

libinta commented 3 months ago

seems except the pin memory issue, nothing print out. looks like machine already in this bad state. are you on aws? and is the issue only happening to this pod? thx

mgujral commented 3 months ago

I am on Voyager cluster at UCSD. It is a dedicated AI machine with 36 Intel-Habana Gaudi nodes. We have recently switched to driver version 1.15.1 and SynapseAI image 1.15.1. This problem only started with switch to 1.15.1. I had notified it to Intel-Habana team and they told me to raise the issue with Huggingface-Habana on Github.

libinta commented 3 months ago

thanks for the information. are you running the same optimum-habana version before and after the Synapse SW upgrade?

mgujral commented 3 months ago

No, I updated to the version recommended for SynsapseAI 1.15.1. If you scroll to the top, you can see the versions.

libinta commented 3 months ago

which optimum-habana version are you using before 1.15.1 update? and which synapse sw version you had before? For debug purpose, can you switch to the old optimum-habana version working with your previous synapse version and try ?

mgujral commented 3 months ago

Earlier I was using SynapsAI 1.13.0 with optimum-habana==1.9.0 and optimum==1.14.1. Applications were working fine.

mgujral commented 3 months ago

I went to SynapseAI image 1.13.0 and used appropriate PYTHONPATH, and things worked fine as they should.

python run_esmfold.py /usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py:252: UserWarning: Device capability of hccl unspecified, assuming cpu and cuda. Please specify it via the devices argument of register_backend. warnings.warn( Some weights of EsmForProteinFolding were not initialized from the model checkpoint at facebook/esmfold_v1 and are newly initialized: ['esm.contact_head.regression.bias', 'esm.contact_head.regression.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. ============================= HABANA PT BRIDGE CONFIGURATION =========================== PT_HPU_LAZY_MODE = 1 PT_RECIPE_CACHE_PATH = PT_CACHE_FOLDER_DELETE = 0 PT_HPU_RECIPE_CACHE_CONFIG = PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807 PT_HPU_LAZY_ACC_PAR_MODE = 1 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0 ---------------------------: System Configuration :--------------------------- Num CPU Cores : 96 CPU RAM : 527939884 KB

ESMFOLD: input shape = torch.Size([1, 350]) ESMFOLD: step 0 start ... ESMFOLD: step 0 duration: 70.469 seconds ESMFOLD: step 1 start ... ESMFOLD: step 1 duration: 36.434 seconds ESMFOLD: step 2 start ... ESMFOLD: step 2 duration: 14.153 seconds ESMFOLD: step 3 start ... ESMFOLD: step 3 duration: 14.138 seconds pdb file saved in save-hpu.pdb

mgujral commented 3 months ago

I used only 1-HPU card.

libinta commented 3 months ago

can you go back to 1.9.0 and 1.14.1 to check with synapse 1.15.1?

mgujral commented 3 months ago

This is what I get. /usr/bin/python run_esmfold.py Traceback (most recent call last): File "/home/mgujral/huggingface-habana/optimum-habana-1.11.1/examples/protein-folding/run_esmfold_original.py", line 20, in import habana_frameworks.torch.core as htcore File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/init.py", line 17, in import torch File "/home/mgujral/deepspeedForImage_1.3.0/torch/init.py", line 519, in raise ImportError(textwrap.dedent(''' ImportError: Failed to load PyTorch C extensions: It appears that PyTorch has loaded the torch/_C folder of the PyTorch repository rather than the C extensions which are expected in the torch._C namespace. This can occur when using the install workflow. e.g. $ python setup.py install && python -c "import torch"

This error can generally be solved using the `develop` workflow
    $ python setup.py develop && python -c "import torch"  # This should succeed
or by running Python from a different directory.
libinta commented 3 months ago

I didn't see any issue with optimum-habana 1.9.0, the way I did is to git checkout v1.9.0, inside docker go to optimum-habana, do "pip install -e ." then go to examples/protein-folding/, do "python run_esmfold.py also I can't reproduce your issue on my setup with new v1.11.1 neither. can you run any other models? like in different example folder?

mgujral commented 3 months ago

Thank for for all your help, I shall take up the matter with Intel-Habana team.

mgujral commented 2 months ago

The problem has been resolved. This line 'hugepages-2Mi: 95000Mi' was missing from my yaml file, and error messages I was getting were misleading.