Closed spachava753 closed 7 months ago
Neuron tools and python libraries are installed fine:
$ neuron-ls
instance-type: trn1.2xlarge
instance-id: i---------------
+--------+--------+--------+---------+
| NEURON | NEURON | NEURON | PCI |
| DEVICE | CORES | MEMORY | BDF |
+--------+--------+--------+---------+
| 0 | 2 | 32 GB | 00:1e.0 |
+--------+--------+--------+---------+
$ python -c 'import torch_neuronx;import transformers;import datasets;import accelerate;import evaluate;import tensorboard;'
I'm experiencing the same issue. Still debugging
pinging @philschmid on this one.
Hey @spachava753,
Thank you for opening the issue, we are going to look at it. The easiest for now is to run
sudo rm -rf /etc/apt/sources.list
sudo apt update -y
which should fix it.
@philschmid Thanks for your suggestion! Is there a repo anywhere to understand how the huggingface DLAMI is built?
@philschmid +1 @spachava753 comment. Also, it seems we immediately need to deal with dependency issues on the latest AMI (even with your suggestion). Huggingface should either maintain backwards compatible AMI's, open source the build repos, or provide an alternative solution. I fully believe in huggingface deep learing ami over an aws ami that prioritizes its internal stack vs whats best for users.
Also, it seems we immediately need to deal with dependency issues on the latest AMI (even with your suggestion)
Can you explain what error you are seeing?
I have created an instance of trn1.2xlarge in AWS using the HuggingFace DLAMI. Then I tried running
sudo apt update -y
, which failed with the following error:Looking inside the file, I see this:
Commenting out line 11 does not help, as then line 16 becomes a problem:
which leads to me to believe that the file was not created correctly, and is actually a template file with unfilled values