Closed dacorvo closed 2 months ago
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@JingyaHuang the neuronx SD cache test is failing.
@michaelbenayoun some trainium tests are failing. It may be related to changes in the way neuronx-distributed loads weights (safetensors related error messages).
@michaelbenayoun there is a newly failing distributed test with AWS 2.18, probably after your latest changes.
It's weird that it fails. The error is:
2024-04-04T16:56:03.5030743Z self = <tests.distributed.test_model_parallelization.TestModelParallelization object at 0x7f8211855c40>
2024-04-04T16:56:03.5031331Z
2024-04-04T16:56:03.5031567Z def _terminate_xrt_server(self):
2024-04-04T16:56:03.5032077Z xrt_server_str = "torch_neuronx.distributed._xrt_run_server"
2024-04-04T16:56:03.5032964Z startmethod = mp.get_start_method(allow_none=True)
2024-04-04T16:56:03.5033499Z # Rules:
2024-04-04T16:56:03.5034194Z # - `startmethod is None`: the XRT server tracks pytest's PID.
2024-04-04T16:56:03.5035254Z # - `startmethod="spawn"`: the parent process of the pool's processes is pytest, so the XRT server tracks
2024-04-04T16:56:03.5036140Z # pytest's PID.
2024-04-04T16:56:03.5036736Z # - `startmethod="fork"`: same as `startmethod="spawn"`.
2024-04-04T16:56:03.5037740Z # - `startmethod="forkserver"`: the parent process of the pool's processes is the forkserver, so the XRT server tracks
2024-04-04T16:56:03.5038751Z # the forkserver's PID.
2024-04-04T16:56:03.5039261Z if startmethod == "forkserver":
2024-04-04T16:56:03.5039889Z target_pid = multiprocessing.forkserver._forkserver._forkserver_pid
2024-04-04T16:56:03.5040480Z else:
2024-04-04T16:56:03.5040812Z target_pid = os.getpid()
2024-04-04T16:56:03.5041254Z
2024-04-04T16:56:03.5041654Z for p in psutil.process_iter():
2024-04-04T16:56:03.5042067Z try:
2024-04-04T16:56:03.5042558Z if "python3" in p.name() and len(p.cmdline()) == 7:
2024-04-04T16:56:03.5043330Z cmdline = p.cmdline()
2024-04-04T16:56:03.5044098Z > if cmdline[2] == xrt_server_str and cmdline[-1] == str(target_pid):
2024-04-04T16:56:03.5044865Z E IndexError: list index out of range
2024-04-04T16:56:03.5045282Z
It might be linked to torch_neuronx
not relying on XRT
now. But why did it fail after so many tests. Let me check with the Annapurna team.
No regression found but the SDXL separate weights: let's merge this !
What does this PR do?
This pull-request bumps the AWS Neuron SDK version to 2.18.
It also bumps the TGI router version to 1.4.4 to fix build issues due to underlying rust packages updates.
To update your local host, do the following: