netease-youdao / QAnything

Question and Answer based on Anything.
https://qanything.ai
GNU Affero General Public License v3.0
11.92k stars 1.16k forks source link

[BUG] <title> python最新版pdf无法解析,已经下载了pdf模型文件 #480

Open changqingla opened 3 months ago

changqingla commented 3 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

按照这个“在modelscope下载相关的解析模型,并将其放置到根目录的qanything_kernel/utils/loader/pdf_to_markdown/checkpoints/下”进行了操作。在qanything_kernel/utils/loader/pdf_to_markdown/checkpoints/目录下git clone https://www.modelscope.cn/netease-youdao/QAnything-pdf-parser.git。但是无法解析pdf: 2024-08-22 12:00:02,808 split error: Traceback (most recent call last): File "/data/ht/rag/qanything_kernel/core/local_doc_qa.py", line 98, in insert_files_to_faiss local_file.split_file_to_docs(self.get_ocr_result) File "/data/ht/rag/qanything_kernel/utils/general_utils.py", line 73, in inner res = func(arg, kwargs) File "/data/ht/rag/qanything_kernel/core/local_file.py", line 169, in split_file_to_docs docs = loader.load_and_split(texts_splitter) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 63, in load_and_split docs = self.load() File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 29, in load return list(self.lazy_load()) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/langchain_community/document_loaders/unstructured.py", line 88, in lazy_load elements = self._get_elements() File "/data/ht/rag/qanything_kernel/utils/loader/pdf_loader.py", line 57, in _get_elements return partition_text(filename=txt_file_path, self.unstructured_kwargs) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/partition/text.py", line 93, in partition_text return _partition_text( File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/documents/elements.py", line 526, in wrapper elements = func(args, kwargs) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 619, in wrapper elements = func(*args, *kwargs) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 574, in wrapper elements = func(args, kwargs) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/chunking/init.py", line 69, in wrapper elements = func(*args, **kwargs) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/partition/text.py", line 169, in _partition_text file_content = _split_by_paragraph( File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/partition/text.py", line 301, in _split_by_paragraph _split_content_to_fit_max( File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/partition/text.py", line 333, in _split_content_to_fit_max sentences = sent_tokenize(content) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/unstructured/nlp/tokenize.py", line 30, in sent_tokenize return _sent_tokenize(text) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/nltk/tokenize/init.py", line 119, in sent_tokenize tokenizer = _get_punkt_tokenizer(language) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/nltk/tokenize/init.py", line 105, in _get_punkt_tokenizer return PunktTokenizer(language) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1744, in init self.load_lang(lang) File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang lang_dir = find(f"tokenizers/punkt_tab/{lang}/") File "/home/ubuntu/.conda/envs/rag-XY/lib/python3.10/site-packages/nltk/data.py", line 579, in find raise LookupError(resource_not_found) LookupError:


Resource punkt_tab not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download('punkt_tab')

For more information see: https://www.nltk.org/data.html

Attempted to load tokenizers/punkt_tab/english/

Searched in:

2024-08-22 12:00:02,809 insert_to_faiss: success num: 0, failed num: 1 2024-08-22 12:00:03,438 list_docs zzp 2024-08-22 12:00:03,439 kb_id: KB68e60de6f07d47daab54fd0bc673aa83

期望行为 | Expected Behavior

No response

运行环境 | Environment

- OS:
- NVIDIA Driver:
- CUDA:
- docker:
- docker-compose:
- NVIDIA GPU:
- NVIDIA GPU Memory:

QAnything日志 | QAnything logs

No response

复现方法 | Steps To Reproduce

No response

备注 | Anything else?

No response

RonaldJEN commented 2 months ago

他用的ragflow的解析。ragflow不支持扫描件所以他也不支持。

linKnowEasy commented 2 months ago

修改 requirements.txt 从

unstructured==0.12.4 unstructured[pptx]==0.12.4 unstructured[md]==0.12.4

改成 unstructured==0.15.7 unstructured[pptx]==0.15.7 unstructured[md]==0.15.7

重新 pip install -r requirements.txt 就可以解析了

xxlxms commented 1 month ago

都是各种抄