huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
176 stars 51 forks source link

Use AWS Neuron sdk 2.18 #547

Closed dacorvo closed 2 months ago

dacorvo commented 3 months ago

What does this PR do?

This pull-request bumps the AWS Neuron SDK version to 2.18.

It also bumps the TGI router version to 1.4.4 to fix build issues due to underlying rust packages updates.

To update your local host, do the following:

$ sudo apt update
$ sudo apt install -u aws-neuronx-dkms aws-neuronx-runtime-lib aws-neuronx-collectives aws-neuronx-tools
$ pip install -U neuronx-cc torch-neuronx==1.13.* transformers-neuronx neuronx-distributed
HuggingFaceDocBuilderDev commented 3 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dacorvo commented 3 months ago

@JingyaHuang the neuronx SD cache test is failing.

dacorvo commented 3 months ago

@michaelbenayoun some trainium tests are failing. It may be related to changes in the way neuronx-distributed loads weights (safetensors related error messages).

dacorvo commented 2 months ago

@michaelbenayoun there is a newly failing distributed test with AWS 2.18, probably after your latest changes.

michaelbenayoun commented 2 months ago

It's weird that it fails. The error is:

2024-04-04T16:56:03.5030743Z self = <tests.distributed.test_model_parallelization.TestModelParallelization object at 0x7f8211855c40>
2024-04-04T16:56:03.5031331Z 
2024-04-04T16:56:03.5031567Z     def _terminate_xrt_server(self):
2024-04-04T16:56:03.5032077Z         xrt_server_str = "torch_neuronx.distributed._xrt_run_server"
2024-04-04T16:56:03.5032964Z         startmethod = mp.get_start_method(allow_none=True)
2024-04-04T16:56:03.5033499Z         # Rules:
2024-04-04T16:56:03.5034194Z         # - `startmethod is None`: the XRT server tracks pytest's PID.
2024-04-04T16:56:03.5035254Z         # - `startmethod="spawn"`: the parent process of the pool's processes is pytest, so the XRT server tracks
2024-04-04T16:56:03.5036140Z         # pytest's PID.
2024-04-04T16:56:03.5036736Z         # - `startmethod="fork"`: same as `startmethod="spawn"`.
2024-04-04T16:56:03.5037740Z         # - `startmethod="forkserver"`: the parent process of the pool's processes is the forkserver, so the XRT server tracks
2024-04-04T16:56:03.5038751Z         # the forkserver's PID.
2024-04-04T16:56:03.5039261Z         if startmethod == "forkserver":
2024-04-04T16:56:03.5039889Z             target_pid = multiprocessing.forkserver._forkserver._forkserver_pid
2024-04-04T16:56:03.5040480Z         else:
2024-04-04T16:56:03.5040812Z             target_pid = os.getpid()
2024-04-04T16:56:03.5041254Z     
2024-04-04T16:56:03.5041654Z         for p in psutil.process_iter():
2024-04-04T16:56:03.5042067Z             try:
2024-04-04T16:56:03.5042558Z                 if "python3" in p.name() and len(p.cmdline()) == 7:
2024-04-04T16:56:03.5043330Z                     cmdline = p.cmdline()
2024-04-04T16:56:03.5044098Z >                   if cmdline[2] == xrt_server_str and cmdline[-1] == str(target_pid):
2024-04-04T16:56:03.5044865Z E                   IndexError: list index out of range
2024-04-04T16:56:03.5045282Z 

It might be linked to torch_neuronx not relying on XRT now. But why did it fail after so many tests. Let me check with the Annapurna team.

dacorvo commented 2 months ago

No regression found but the SDXL separate weights: let's merge this !