Closed mgujral closed 2 months ago
@mgujral I cannot reproduce this. Are you running
python run_esmfold.py
in https://github.com/huggingface/optimum-habana/tree/main/examples/protein-folding?
This is the easiest tool to run. I do not know what is going on?
I am using Gaudi HPU nodes.
@mgujral Can you collect logs with HABANA_PROFILE=./log LOG_LEVEL_ALL=0 before your cmd? And attach the log folder? Thanks.
These are logs before launching the command. Appreciate your help!!!
I looked at the file 'dfa_log.txt', and it had the following message at top. Any thoughts what does it mean?
hl thunk API failed, errCode memcopyIoctlFailed
@mgujral it is related to host memory allocation. Can you watch the host memory with top and capture the screen from start to crash? Also what's the output of grep Huge /proc/meminfo ? you can also change https://github.com/huggingface/optimum-habana/blob/main/examples/protein-folding/run_esmfold.py#L85 low_cpu_mem_usage from False to True and retry. Another way is to request 8x instead of 2x to see if you can pass the test. Thanks, Libin
When I try with 8-HPU cards, warning and error messages are same as before. Altering the line run_esmfold.py#L85 does not make any difference either. When I check memory usage with 'top' command, max memory usage goes too about 4.5%. I notice one more thing, when I try any application in language-modelling folder I get the identical 'Synapse detected a device critical error......' error message.
grep Huge /proc/meminfo
AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 48000 HugePages_Free: 48000 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 98304000 kB
when you said you try with 8-hpu card, you are still using python run_esmfold.py , not gaudi_spawn 8x, right? can you also do sudo dmesg -T >dmesg.txt? thx
Yes, with 8-hpu card, I am still using python run_esmfold.py. Thank you for your patience & time. dmesg.txt
seems except the pin memory issue, nothing print out. looks like machine already in this bad state. are you on aws? and is the issue only happening to this pod? thx
I am on Voyager cluster at UCSD. It is a dedicated AI machine with 36 Intel-Habana Gaudi nodes. We have recently switched to driver version 1.15.1 and SynapseAI image 1.15.1. This problem only started with switch to 1.15.1. I had notified it to Intel-Habana team and they told me to raise the issue with Huggingface-Habana on Github.
thanks for the information. are you running the same optimum-habana version before and after the Synapse SW upgrade?
No, I updated to the version recommended for SynsapseAI 1.15.1. If you scroll to the top, you can see the versions.
which optimum-habana version are you using before 1.15.1 update? and which synapse sw version you had before? For debug purpose, can you switch to the old optimum-habana version working with your previous synapse version and try ?
Earlier I was using SynapsAI 1.13.0 with optimum-habana==1.9.0 and optimum==1.14.1. Applications were working fine.
I went to SynapseAI image 1.13.0 and used appropriate PYTHONPATH, and things worked fine as they should.
cpu
and cuda
. Please specify it via the devices
argument of register_backend
.
warnings.warn(
Some weights of EsmForProteinFolding were not initialized from the model checkpoint at facebook/esmfold_v1 and are newly initialized: ['esm.contact_head.regression.bias', 'esm.contact_head.regression.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 96
CPU RAM : 527939884 KBESMFOLD: input shape = torch.Size([1, 350]) ESMFOLD: step 0 start ... ESMFOLD: step 0 duration: 70.469 seconds ESMFOLD: step 1 start ... ESMFOLD: step 1 duration: 36.434 seconds ESMFOLD: step 2 start ... ESMFOLD: step 2 duration: 14.153 seconds ESMFOLD: step 3 start ... ESMFOLD: step 3 duration: 14.138 seconds pdb file saved in save-hpu.pdb
I used only 1-HPU card.
can you go back to 1.9.0 and 1.14.1 to check with synapse 1.15.1?
This is what I get.
/usr/bin/python run_esmfold.py
Traceback (most recent call last):
File "/home/mgujral/huggingface-habana/optimum-habana-1.11.1/examples/protein-folding/run_esmfold_original.py", line 20, in torch/_C
folder
of the PyTorch repository rather than the C extensions which
are expected in the torch._C
namespace. This can occur when
using the install
workflow. e.g.
$ python setup.py install && python -c "import torch"
This error can generally be solved using the `develop` workflow
$ python setup.py develop && python -c "import torch" # This should succeed
or by running Python from a different directory.
I didn't see any issue with optimum-habana 1.9.0, the way I did is to git checkout v1.9.0, inside docker go to optimum-habana, do "pip install -e ." then go to examples/protein-folding/, do "python run_esmfold.py also I can't reproduce your issue on my setup with new v1.11.1 neither. can you run any other models? like in different example folder?
Thank for for all your help, I shall take up the matter with Intel-Habana team.
The problem has been resolved. This line 'hugepages-2Mi: 95000Mi' was missing from my yaml file, and error messages I was getting were misleading.
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python3 run_esmfold.py
It does not have any requirements.
Expected behavior
The run should generate protein structure file with extension pdb.