microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.89k stars 175 forks source link

TXT2IMAGE - TXTRuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmStridedBatchedExFix #110

Closed eran-sefirot closed 1 year ago

eran-sefirot commented 1 year ago

Hello! Thanks for this great optimization, We're using a fresh ec2 G5XL instance,

After installing everything and running python baseline-sd.py I see the following error:

    attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

I've installed the envoirment using: pip install deepspeed[sd] deepspeed-mii

when running ds_report I see the following output:


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-devel package with yum [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] utils .................. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] spatial_inference ...... [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/opt/conda/lib/python3.9/site-packages/torch'] torch version .................... 1.13.0+cu117 torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.5 deepspeed install path ........... ['/opt/conda/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.7.5, unknown, unknown deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

eran-sefirot commented 1 year ago

when running: python mii-sd.py

a_build raise RuntimeError(message) from e RuntimeError: Error building extension 'transformer_inference' [2022-11-27 11:35:16,846] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 6581 [2022-11-27 11:35:16,846] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/bin/python', '-m', 'mii.launch.multi_gpu_server', '--task-name', 'text-to-image', '--model', 'CompVis/stable-diffusion-v1-4', '--model-path', '/tmp/mii_models', '--port', '50050', '--ds-optimize', '--provider', 'diffusers', '--config', 'eyJ0ZW5zb3JfcGFyYWxsZWwiOiAxLCAicG9ydF9udW1iZXIiOiA1MDA1MCwgImR0eXBlIjogImZwMTYiLCAiZW5hYmxlX2N1ZGFfZ3JhcGgiOiBmYWxzZSwgImNoZWNrcG9pbnRfZGljdCI6IG51bGwsICJkZXBsb3lfcmFuayI6IFswXSwgInRvcmNoX2Rpc3RfcG9ydCI6IDI5NTAwLCAiaGZfYXV0aF90b2tlbiI6ICJoZl9Xc0NwVWFFYVhMbGtEZEtLTkVtS2NxZk9vTHBjcWxXWHF5IiwgInJlcGxhY2Vfd2l0aF9rZXJuZWxfaW5qZWN0IjogdHJ1ZSwgInByb2ZpbGVfbW9kZWxfdGltZSI6IGZhbHNlLCAic2tpcF9tb2RlbF9jaGVjayI6IGZhbHNlfQ=='] exits with return code = 1 [2022-11-27 11:35:18,791] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start... Traceback (most recent call last): File "/home/ec2-user/DeepSpeed-MII/examples/benchmark/txt2img/mii-sd.py", line 15, in mii.deploy(task='text-to-image', File "/opt/conda/lib/python3.9/site-packages/mii/deployment.py", line 114, in deploy return _deploy_local(deployment_name, model_path=model_path) File "/opt/conda/lib/python3.9/site-packages/mii/deployment.py", line 120, in _deploy_local mii.utils.import_score_file(deployment_name).init() File "/tmp/mii_cache/sd_deploy/score.py", line 29, in init model = mii.MIIServerClient(task, File "/opt/conda/lib/python3.9/site-packages/mii/server_client.py", line 92, in init self._wait_until_server_is_live() File "/opt/conda/lib/python3.9/site-packages/mii/server_client.py", line 115, in _wait_until_server_is_live raise RuntimeError("server crashed for some reason, unable to proceed") RuntimeError: server crashed for some reason, unable to proceed

eran-sefirot commented 1 year ago

OK I've installed the latest AMI for deep learning with cuda 11.7 now I get the following when running python mii-sd.py:

/opt/conda/envs/pytorch/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu:8:10: fatal error: cuda_profiler_api.h: No such file or directory

include

      ^~~~~~~~~~~~~~~~~~~~~
eran-sefirot commented 1 year ago

I've switched to different AMI with pytorch 1.2 and cuda 1.6 and now I get the following error:

Time to load spatial_inference op: 17.237044095993042 seconds **** found and replaced unet w. <class 'deepspeed.model_implementations.diffusers.unet.DSUNet'> About to start server Started [2022-11-27 13:35:10,519] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start... [2022-11-27 13:35:15,524] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start... [2022-11-27 13:35:15,524] [INFO] [server_client.py:118:_wait_until_server_is_live] server has started on 50050 Traceback (most recent call last): File "/home/ec2-user/DeepSpeed-MII/examples/benchmark/txt2img/mii-sd.py", line 23, in results = pipe.query(prompts) File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/mii/server_client.py", line 367, in query response = self.asyncio_loop.run_until_complete( File "/opt/conda/envs/pytorch/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result() File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/mii/server_client.py", line 263, in _query_in_tensor_parallel await responses[0] File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/mii/server_client.py", line 313, in _request_async_response response = await self.stubs[stub_id].Txt2ImgReply(req) File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/grpc/aio/_call.py", line 290, in await raise _create_rpc_error(self._cython_call._initial_metadata, grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.UNKNOWN details = "Exception calling application: 'DSUNet' object has no attribute 'config'" debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:50050 {grpc_message:"Exception calling application: \'DSUNet\' object has no attribute \'config\'", grpc_status:2, created_time:"2022-11-27T13:35:15.530649601+00:00"}"

mrwyattii commented 1 year ago

This was resolved recently. Please see https://github.com/microsoft/DeepSpeed-MII/issues/112#issuecomment-1334475650

mrwyattii commented 1 year ago

Please reopen if this issue is still not resolved.