huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
207 stars 61 forks source link

Hugging Face DLAMI apt sources is malformed #270

Closed spachava753 closed 7 months ago

spachava753 commented 1 year ago

I have created an instance of trn1.2xlarge in AWS using the HuggingFace DLAMI. Then I tried running sudo apt update -y, which failed with the following error:

$ sudo apt-get update -y
E: Malformed entry 11 in list file /etc/apt/sources.list (URI parse)
E: The list of sources could not be read.

Looking inside the file, I see this:

## Note, this file is written by cloud-init on first boot of an instance
## modifications made here will not survive a re-bundle.
## if you wish to make changes you can:
## a.) add 'apt_preserve_sources_list: true' to /etc/cloud/cloud.cfg
##     or do the same in user-data
## b.) add sources in /etc/apt/sources.list.d
## c.) make changes to template file /etc/cloud/templates/sources.list.tmpl

# See http://help.ubuntu.com/community/UpgradeNotes for how to upgrade to
# newer versions of the distribution.
deb {{mirror}} {{codename}} main restricted
# deb-src {{mirror}} {{codename}} main restricted

## Major bug fix updates produced after the final release of the
## distribution.
deb {{mirror}} {{codename}}-updates main restricted
# deb-src {{mirror}} {{codename}}-updates main restricted

## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu
## team. Also, please note that software in universe WILL NOT receive any
## review or updates from the Ubuntu security team.
deb {{mirror}} {{codename}} universe
# deb-src {{mirror}} {{codename}} universe
deb {{mirror}} {{codename}}-updates universe
# deb-src {{mirror}} {{codename}}-updates universe

## N.B. software from this repository is ENTIRELY UNSUPPORTED by the Ubuntu
## team, and may not be under a free licence. Please satisfy yourself as to
## your rights to use the software. Also, please note that software in
## multiverse WILL NOT receive any review or updates from the Ubuntu
## security team.
deb {{mirror}} {{codename}} multiverse
# deb-src {{mirror}} {{codename}} multiverse
deb {{mirror}} {{codename}}-updates multiverse
# deb-src {{mirror}} {{codename}}-updates multiverse

## N.B. software from this repository may not have been tested as
## extensively as that contained in the main release, although it includes
## newer versions of some applications which may provide useful features.
## Also, please note that software in backports WILL NOT receive any review
## or updates from the Ubuntu security team.
deb {{mirror}} {{codename}}-backports main restricted universe multiverse
# deb-src {{mirror}} {{codename}}-backports main restricted universe multiverse

## Uncomment the following two lines to add software from Canonical's
## 'partner' repository.
## This software is not part of Ubuntu, but is offered by Canonical and the
## respective vendors as a service to Ubuntu users.
# deb http://archive.canonical.com/ubuntu {{codename}} partner
# deb-src http://archive.canonical.com/ubuntu {{codename}} partner

deb {{security}} {{codename}}-security main restricted
# deb-src {{security}} {{codename}}-security main restricted
deb {{security}} {{codename}}-security universe
# deb-src {{security}} {{codename}}-security universe
deb {{security}} {{codename}}-security multiverse
# deb-src {{security}} {{codename}}-security multiverse

Commenting out line 11 does not help, as then line 16 becomes a problem:

$ sudo apt-get update -y
E: Malformed entry 16 in list file /etc/apt/sources.list (URI parse)
E: The list of sources could not be read.

which leads to me to believe that the file was not created correctly, and is actually a template file with unfilled values

spachava753 commented 1 year ago

Neuron tools and python libraries are installed fine:

$ neuron-ls
instance-type: trn1.2xlarge
instance-id: i---------------
+--------+--------+--------+---------+
| NEURON | NEURON | NEURON |   PCI   |
| DEVICE | CORES  | MEMORY |   BDF   |
+--------+--------+--------+---------+
| 0      | 2      | 32 GB  | 00:1e.0 |
+--------+--------+--------+---------+
$ python -c 'import torch_neuronx;import transformers;import datasets;import accelerate;import evaluate;import tensorboard;'
zachsmith1 commented 1 year ago

I'm experiencing the same issue. Still debugging

michaelbenayoun commented 1 year ago

pinging @philschmid on this one.

philschmid commented 1 year ago

Hey @spachava753,

Thank you for opening the issue, we are going to look at it. The easiest for now is to run

sudo rm -rf /etc/apt/sources.list
sudo apt update -y

which should fix it.

spachava753 commented 1 year ago

@philschmid Thanks for your suggestion! Is there a repo anywhere to understand how the huggingface DLAMI is built?

zachsmith1 commented 1 year ago

@philschmid +1 @spachava753 comment. Also, it seems we immediately need to deal with dependency issues on the latest AMI (even with your suggestion). Huggingface should either maintain backwards compatible AMI's, open source the build repos, or provide an alternative solution. I fully believe in huggingface deep learing ami over an aws ami that prioritizes its internal stack vs whats best for users.

philschmid commented 1 year ago

Also, it seems we immediately need to deal with dependency issues on the latest AMI (even with your suggestion)

Can you explain what error you are seeing?