Shark V1 Nov 2024 Release Testing Bash

pdhirajkumarprasad commented 3 days ago

Please login into a MI300X machine. For AMD Shark Team, see internal slack channel on available machines.

Create a python virtual environment of your liking and activate it. If you are new to python, the simplest way is to do following the first time:
```
python3.12 -m venv .venv
source .venv/bin/activate
```
and following subsequently:
```
source .venv/bin/activate
```
Use the instruction on this Shark V1 Release Testing Bash to install and test the shark V1 release.

Feel free to test it however you like but here are some guidelines you could follow.

Testing guidelines

Run multiple servers on the same machine with different port numbers
Logout and login and try multiple times
Try using different options (flags) for servers and clients (see tables below)

Multiple people may try the same feature so whoever is trying a particular feature, please put your under "Testers" column in the tables below.

shortfin_apps.sd.server with different options:

Flags	options	Issues
--host HOST
--port PORT
--root-path ROOT_PATH
--timeout-keep-alive
--device	local-task,hip,amdgpu
--target	gfx942,gfx1100	https://github.com/nod-ai/SHARK-Platform/issues/515
--device_ids
--tokenizers
--model_config
--workers_per_device
--fibers_per_device
--isolation	per_fiber, per_call, none
--show_progress
--trace_execution
--amdgpu_async_allocations
--splat
--build_preference	compile,precompiled
--compile_flags
--flagfile FLAGFILE		https://github.com/nod-ai/SHARK-Platform/issues/515
--artifacts_dir ARTIFACTS_DIR

shortfin_apps.sd.simple_client with different options:

Flags	Testers	Issues
--file
--reps
--save
--outputdir
--steps
--interactive

other issues

Issue description	issue no

dan-garvey commented 3 days ago

not a critique, just something I noticed, takes about 12 min for server startup on a cirrascale 8x mi300 machine

IanNod commented 3 days ago

I had same server startup time. I attributed it to downloading models/weights on setup.

Minor critique it does not look like we are changing the random latents generated. Not sure where that is controlled but was seeing the same image generated given the same prompt

dan-garvey commented 3 days ago

yeah as Ian said the seed appears fixed, I think when reps>1 it should be changed.

maybe this works?


                async for i in async_range(args.reps):
                data["seed"] = [i]
                pending.append(
                    asyncio.create_task(send_request(session, i, args, data))
                )
                await asyncio.sleep(
                    1
                )  # Wait for 1 second before sending the next request```

dan-garvey commented 3 days ago

well at least in the args.reps>1 case

archana-ramalingam commented 3 days ago

At cold start, incomplete model download causes the following issue. Deleting cached models and re-downloading them fixed it.

INFO:root:Loading parameter fiber 'model' from: /home/aramalin/.cache/shark/genfiles/sdxl/stable_diffusion_xl_base_1_0_punet_dataset_i8.irpa Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/aramalin/SHARK-Platform/3.12.venv/lib/python3.12/site-packages/shortfin_apps/sd/server.py", line 388, in main( File "/home/aramalin/SHARK-Platform/3.12.venv/lib/python3.12/site-packages/shortfin_apps/sd/server.py", line 376, in main sysman = configure(args) ^^^^^^^^^^^^^^^ File "/home/aramalin/SHARK-Platform/3.12.venv/lib/python3.12/site-packages/shortfin_apps/sd/server.py", line 115, in configure sm.load_inference_parameters(*datasets, parameter_scope="model", component=key) File "/home/aramalin/SHARK-Platform/3.12.venv/lib/python3.12/site-packages/shortfin_apps/sd/components/service.py", line 116, in load_inference_parameters p.load(path, format=format) ValueError: shortfin_iree-src/runtime/src/iree/io/formats/irpa/irpa_parser.c:16: OUT_OF_RANGE; file segment out of range (1766080 to 2614665369 for 2612899290, file_size=726679552); verifying storage segment

pdhirajkumarprasad commented 3 days ago

I have tried almost all flags and different stuff for client/server and added my observation here https://github.com/pdhirajkumarprasad/for_sharing_logs/blob/main/Shark-V1(Nov,%202024)-Bash.md

nod-ai / shark-ai