infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
22.82k stars 2.24k forks source link

[Question]: unable to parse a file due to minio error #2162

Open esychoi opened 2 months ago

esychoi commented 2 months ago

Describe your problem

The port 9000 and 9001 are already taken by other containers, so I manually changed 9000 to 4000 (minio port) and 9001 to 4001 (minio console port) for minio settings in all relevant files in the docker/ folder: .env, service_conf.yaml, docker-compose-base.yml.

I launched the app by following the steps in the readme. The docker containers are seem working (ragflow-mysql and ragflow-es-01 are healthy, others are running).

However, when I upload a file in a knowledge base, I can't parse it. This error is shown when I hover the "Fail" button: [ERROR]Internal server error while chunking: Package not found at contexts.docx image

Logs:

...
[WARNING] Load term.freq FAIL!
[WARNING] Load term.freq FAIL!
config.json: 100%|██████████| 650/650 [00:00<00:00, 3.63MB/s]
README.md: 100%|██████████| 1.13k/1.13k [00:00<00:00, 8.45MB/s]
special_tokens_map.json: 100%|██████████| 695/695 [00:00<00:00, 5.15MB/s]
tokenizer_config.json: 100%|██████████| 1.43k/1.43k [00:00<00:00, 7.16MB/s]
.gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 6.42MB/s]
tokenizer.json: 100%|██████████| 712k/712k [00:00<00:00, 14.4MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.31MB/s]
model.onnx: 100%|██████████| 90.4M/90.4M [00:01<00:00, 75.0MB/s]
Fetching 8 files: 100%|██████████| 8/8 [00:01<00:00,  4.41it/s]]
[WARNING] [2024-08-29 08:48:36,704] [connectionpool.urlopen] [line:874]: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f395cf3ef20>: Failed to establish a new connection: [Errno 111] Connection refused')': /9bb5977065e111ef90fa0242ac160006?location=
[WARNING] [2024-08-29 08:48:37,105] [connectionpool.urlopen] [line:874]: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f395d0b4c10>: Failed to establish a new connection: [Errno 111] Connection refused')': /9bb5977065e111ef90fa0242ac160006?location=
[WARNING] [2024-08-29 08:48:37,907] [connectionpool.urlopen] [line:874]: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f395d0b5f30>: Failed to establish a new connection: [Errno 111] Connection refused')': /9bb5977065e111ef90fa0242ac160006?location=
[WARNING] [2024-08-29 08:48:39,510] [connectionpool.urlopen] [line:874]: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f395cf17430>: Failed to establish a new connection: [Errno 111] Connection refused')': /9bb5977065e111ef90fa0242ac160006?location=
[WARNING] [2024-08-29 08:48:42,715] [connectionpool.urlopen] [line:874]: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f395cf16d70>: Failed to establish a new connection: [Errno 111] Connection refused')': /9bb5977065e111ef90fa0242ac160006?location=
Traceback (most recent call last):
  File "/ragflow/rag/svr/task_executor.py", line 177, in build
    cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"],
  File "/ragflow/rag/app/naive.py", line 196, in chunk
    sections, tbls = Docx()(filename, binary)
  File "/ragflow/rag/app/naive.py", line 55, in __call__
    self.doc = Document(
  File "/usr/local/lib/python3.10/dist-packages/docx/api.py", line 23, in Document
    document_part = Package.open(docx).main_document_part
  File "/usr/local/lib/python3.10/dist-packages/docx/opc/package.py", line 116, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/usr/local/lib/python3.10/dist-packages/docx/opc/pkgreader.py", line 22, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "/usr/local/lib/python3.10/dist-packages/docx/opc/phys_pkg.py", line 21, in __new__
    raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
docx.opc.exceptions.PackageNotFoundError: Package not found at 'contexts.docx'

The embedding model is loaded, but it seems that the file I uploaded actually doesn't exist.

After searching other issues, I am guessing it has something to do with minio. Indeed, here is the minio.log:

fail get 9bb5977065e111ef90fa0242ac160006/contexts.docx: HTTPConnectionPool(host='minio', port=4000): Max retries exceeded with url: /9bb5977065e111ef90fa0242ac160006?location= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f395cf17c40>: Failed to establish a new connection: [Errno 111] Connection refused'))

How can I fix it, please ?

KevinHuSh commented 2 months ago

Check service_conf.yaml about the minio part. The port need to be altered.

esychoi commented 2 months ago

Thanks for your answer. Yes I have changed the minio port number in service_conf.yaml (the one in the docker/ folder and the one in the conf/ folder). They are both exactly the same

ragflow:
  host: 0.0.0.0
  http_port: 9380
mysql:
  name: 'rag_flow'
  user: 'root'
  password: 'infini_rag_flow'
  host: 'mysql'
  port: 3306
  max_connections: 100
  stale_timeout: 30
minio:
  user: 'rag_flow'
  password: 'infini_rag_flow'
  host: 'minio:4000'
es:
  hosts: 'http://es01:9200'
  username: 'elastic'
  password: 'infini_rag_flow'
redis:
  db: 1
  password: 'infini_rag_flow'
  host: 'redis:6379'
user_default_llm:
  factory: 'OpenAI'
  api_key: 'sk-xxxxxxx'
  base_url: ''
oauth:
  github:
    client_id: xxx
    secret_key: xxx
    url: xxx
authentication:
  client:
    switch: false
    http_app_key:
    http_secret_key:
  site:
    switch: false
permission:
  switch: false
  component: false
  dataset: false

I can access minio when going to http://<MY-IP>:4001/browser, but no bucket nor file is created.

Btw, in the settings, I have this:

image

esychoi commented 2 months ago

I managed to clean my port 9000. Now I have MINIO_PORT=9000 and MINIO_CONSOLE_PORT=4001 and it works. It seems that we can configure the MINIO_CONSOLE_PORT but not the MINIO_PORT (I don't know why though)

KevinHuSh commented 2 months ago

You can access minio by http://:4001/browser, so, host: MY-IP:4000 And make sure the port is exported when you use docker.