opea-project / GenAIComps

GenAI components at micro-service level; GenAI service composer to create mega-service
Apache License 2.0
76 stars 138 forks source link

Can NOT upload docx file with embedded images to dataprep-redis service #407

Closed lianhao closed 2 months ago

lianhao commented 3 months ago

When I try to upload a docx file with embedded images test.docx to the datapre-redis service (built and launch from here ),

I found the following error in curl:

$ curl -v -X POST -H "Content-Type: multipart/form-data" -F "files=@./test.docx" http://localhost:6007/v1/dataprep
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:6007...
* Connected to localhost (127.0.0.1) port 6007 (#0)
> POST /v1/dataprep HTTP/1.1
> Host: localhost:6007
> User-Agent: curl/7.81.0
> Accept: */*
> Content-Length: 77599
> Content-Type: multipart/form-data; boundary=------------------------435d585403b158ca
>
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< date: Mon, 05 Aug 2024 02:52:16 GMT
< server: uvicorn
< content-length: 21
< content-type: text/plain; charset=utf-8
<
* Connection #0 to host localhost left intact
Internal Server Error

Checking the dataprep-redis service logs found the following errors:

$ sudo -E docker compose -f docker-compose-dataprep-redis.yaml logs dataprep-redis
... ...
... ...
dataprep-redis-server  | files:UploadFile(filename='test.docx', size=77397, headers=Headers({'content-disposition': 'form-data; name="files"; filename="test.docx"', 'content-type': 'application/octet-stream'}))
dataprep-redis-server  | link_list:None
dataprep-redis-server  | Parsing document ./uploaded_files/test.docx.
dataprep-redis-server  | INFO:     172.20.0.1:46260 - "POST /v1/dataprep HTTP/1.1" 500 Internal Server Error
dataprep-redis-server  | ERROR:    Exception in ASGI application
dataprep-redis-server  | Traceback (most recent call last):
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
dataprep-redis-server  |     result = await app(  # type: ignore[func-returns-value]
dataprep-redis-server  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
dataprep-redis-server  |     return await self.app(scope, receive, send)
dataprep-redis-server  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
dataprep-redis-server  |     await super().__call__(scope, receive, send)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
dataprep-redis-server  |     await self.middleware_stack(scope, receive, send)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
dataprep-redis-server  |     raise exc
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
dataprep-redis-server  |     await self.app(scope, receive, _send)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 174, in __call__
dataprep-redis-server  |     raise exc
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 172, in __call__
dataprep-redis-server  |     await self.app(scope, receive, send_wrapper)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
dataprep-redis-server  |     await self.app(scope, receive, send)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
dataprep-redis-server  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
dataprep-redis-server  |     raise exc
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
dataprep-redis-server  |     await app(scope, receive, sender)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
dataprep-redis-server  |     await self.middleware_stack(scope, receive, send)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
dataprep-redis-server  |     await route.handle(scope, receive, send)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
dataprep-redis-server  |     await self.app(scope, receive, send)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
dataprep-redis-server  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
dataprep-redis-server  |     raise exc
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
dataprep-redis-server  |     await app(scope, receive, sender)
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
dataprep-redis-server  |     response = await func(request)
dataprep-redis-server  |                ^^^^^^^^^^^^^^^^^^^
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
dataprep-redis-server  |     raw_response = await run_endpoint_function(
dataprep-redis-server  |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
dataprep-redis-server  |     return await dependant.call(**values)
dataprep-redis-server  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/langsmith/run_helpers.py", line 468, in async_wrapper
dataprep-redis-server  |     raise e
dataprep-redis-server  |   File "/home/user/.local/lib/python3.11/site-packages/langsmith/run_helpers.py", line 454, in async_wrapper
dataprep-redis-server  |     function_result = await asyncio.create_task(  # type: ignore[call-arg]
dataprep-redis-server  |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
dataprep-redis-server  |   File "/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py", line 232, in ingest_documents
dataprep-redis-server  |     ingest_data_to_redis(
dataprep-redis-server  |   File "/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py", line 127, in ingest_data_to_redis
dataprep-redis-server  |     content = document_loader(path)
dataprep-redis-server  |               ^^^^^^^^^^^^^^^^^^^^^
dataprep-redis-server  |   File "/home/user/comps/dataprep/utils.py", line 337, in document_loader
dataprep-redis-server  |     return load_docx(doc_path)
dataprep-redis-server  |            ^^^^^^^^^^^^^^^^^^^
dataprep-redis-server  |   File "/home/user/comps/dataprep/utils.py", line 191, in load_docx
dataprep-redis-server  |     os.makedirs(save_path, exist_ok=True)
dataprep-redis-server  |   File "<frozen os>", line 225, in makedirs
dataprep-redis-server  | PermissionError: [Errno 13] Permission denied: './imgs/'
ctao456 commented 3 months ago

comment to take issue

lianhao commented 3 months ago

another potential related issue opea-project/GenAIExamples#568

ctao456 commented 3 months ago

Hi @lianhao I got a similar "permission denied" error when trying to reproduce your error. I downloaded "test.docx" and tried to upload it via the curl command curl -v -X POST -H "Content-Type: multipart/form-data" -F "files=@./test.docx" http://localhost:6007/v1/dataprep But got internal server error
Attached the printouts from docker logs dataprep-redis-server

files:UploadFile(filename='test.docx', size=77397, headers=Headers({'content-disposition': 'form-data; name="files"; filename="test.docx"', 'content-type': 'application/octet-stream'}))
link_list:None
Parsing document ./uploaded_files/test.docx.
INFO:     172.17.0.1:52896 - "POST /v1/dataprep HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 398, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/user/.local/lib/python3.11/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 174, in __call__
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 172, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/langsmith/run_helpers.py", line 486, in async_wrapper
    raise e
  File "/home/user/.local/lib/python3.11/site-packages/langsmith/run_helpers.py", line 472, in async_wrapper
    function_result = await asyncio.create_task(  # type: ignore[call-arg]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py", line 200, in ingest_documents
    ingest_data_to_redis(
  File "/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py", line 167, in ingest_data_to_redis
    content = document_loader(path)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/comps/dataprep/utils.py", line 337, in document_loader
    return load_docx(doc_path)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/user/comps/dataprep/utils.py", line 192, in load_docx
    docx2txt.process(docx_path, save_path)
  File "/home/user/.local/lib/python3.11/site-packages/docx2txt/docx2txt.py", line 103, in process
    with open(dst_fname, "wb") as dst_f:
         ^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: './imgs/image1.png'

My suspicion is that the current dataprep version is still not supporting docx files that contain png images. I tried uploading gaudi3_whitepaper but encountered a different issue:

Using CPU. Note: This module is much faster with a GPU.
files:UploadFile(filename='gaudi-3-ai-accelerator-white-paper.pdf', size=2390860, headers=Headers({'content-disposition': 'form-data; name="files"; filename="gaudi-3-ai-accelerator-white-paper.pdf"', 'content-type': 'application/pdf'}))
link_list:None
Parsing document ./uploaded_files/gaudi-3-ai-accelerator-white-paper.pdf.
Done preprocessing. Created  52  chunks of the original pdf
[ ingest chunks ] file name: gaudi-3-ai-accelerator-white-paper.pdf
[ ingest chunks ] Current batch: 0
INFO:     172.17.0.1:42152 - "POST /v1/dataprep HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
  File "/home/user/.local/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: http://172.25.116.82:6006/

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 398, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/user/.local/lib/python3.11/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 174, in __call__
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 172, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/langsmith/run_helpers.py", line 486, in async_wrapper
    raise e
  File "/home/user/.local/lib/python3.11/site-packages/langsmith/run_helpers.py", line 472, in async_wrapper
    function_result = await asyncio.create_task(  # type: ignore[call-arg]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py", line 200, in ingest_documents
    ingest_data_to_redis(
  File "/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py", line 176, in ingest_data_to_redis
    return ingest_chunks_to_redis(file_name, chunks)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py", line 127, in ingest_chunks_to_redis
    _, keys = Redis.from_texts_return_keys(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/langchain_community/vectorstores/redis/base.py", line 423, in from_texts_return_keys
    keys = instance.add_texts(texts, metadatas, keys=keys)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/langchain_community/vectorstores/redis/base.py", line 694, in add_texts
    embeddings = embeddings or self._embeddings.embed_documents(list(texts))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/langchain_community/embeddings/huggingface_hub.py", line 116, in embed_documents
    responses = self.client.post(
                ^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/huggingface_hub/inference/_client.py", line 304, in post
    hf_raise_for_status(response)
  File "/home/user/.local/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 367, in hf_raise_for_status
    raise HfHubHTTPError(message, response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 

403 Forbidden: None.
Cannot access content at: http://172.25.116.82:6006/.
If you are trying to create or update content, make sure you have a token with the `write` role.

I would recommend trying the ChatQnA example first, where dataprep-redis-server is one of the microservices run from compose.yaml. There I was able to successfully run the above curl command for gaudi3_whitepaper

lianhao commented 3 months ago

@ctao456 your issue could be resolved by passing in the HUGGINGFACEHUB_API_TOKEN environment variable to the container. We should resolve the docx file issue, opea-project/GenAIExamples#568 mentioned another issue with docx file which doesn't contain any picture in it.

ctao456 commented 3 months ago

Hi @lianhao thank you. I already tried passing in -e HUGGINGFACEHUB_API_TOKEN=$(HUGGINGFACEHUB_API_TOKEN) when I docker run the instance, and that still resulted in the above printout. I verified that TEI is running and my hf api token has write access to baai/bge-base-en-v1.5. So not sure of the reason. However, with same configs, the docker instance run from GenAIExamples/ChatQnA works. Good to hear about a potential solution to uploading docx files.

lianhao commented 3 months ago

@ctao456 as for the .img permission denied issue, I guess it related to the function https://github.com/opea-project/GenAIComps/blob/main/comps/dataprep/utils.py#L191 where it tries to create a temporary directory where it doesn't have the write permission. I would suggest to create the temporary directory using Python's tempfile module, instead of writing your own mktemdir/delete logic

ctao456 commented 3 months ago

@ctao456 as for the .img permission denied issue, I guess it related to the function https://github.com/opea-project/GenAIComps/blob/main/comps/dataprep/utils.py#L191 where it tries to create a temporary directory where it doesn't have the write permission. I would suggest to create the temporary directory using Python's tempfile module, instead of writing your own mktemdir/delete logic

Understood. Please feel free to commit a pr. Thanks.

lianhao commented 3 months ago

@ctao456 as for the .img permission denied issue, I guess it related to the function https://github.com/opea-project/GenAIComps/blob/main/comps/dataprep/utils.py#L191 where it tries to create a temporary directory where it doesn't have the write permission. I would suggest to create the temporary directory using Python's tempfile module, instead of writing your own mktemdir/delete logic

Understood. Please feel free to commit a pr. Thanks.

Unfortunately, I don't have bandwidth to resolve this right now.

lianhao commented 2 months ago

Please assign this bug to me. I have pending PRs to be submitted.

lianhao commented 2 months ago

Completed as PR #561 is merged