Problems uploading office and .epub docs

Under Windows10, freshly installed h2ogpt. Libre Office has been already installed when I installed h2ogpt, launched with: python generate.py --base_model=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 --langchain_mode=MyData --score_model=None --share=False --gradio_offline_level=1

| 0/1 [00:00<?, ?it/s]Failed to ingest C:\Users\xyz\AppData\Local\Temp\gradio\04d147c3a6328f3211f651f1022dff2737b7a6b0\zzz.doc due to Traceback (most recent call last): File "e:\text-ai\envs\h2ogpt\lib\site-packages\unstructured\partition\common.py", line 153, in convert_office_doc process = subprocess.Popen( File "e:\text-ai\envs\h2ogpt\lib\subprocess.py", line 971, in init self._execute_child(args, executable, preexec_fn, close_fds, File "e:\text-ai\envs\h2ogpt\lib\subprocess.py", line 1456, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] Das System kann die angegebene Datei nicht finden

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "E:\TEXT-AI\h2ogpt\src\gpt_langchain.py", line 1297, in path_to_doc1 res = file_to_doc(file, base_path=None, verbose=verbose, fail_any_exception=fail_any_exception, File "E:\TEXT-AI\h2ogpt\src\gpt_langchain.py", line 1117, in file_to_doc docs1 = UnstructuredWordDocumentLoader(file_path=file).load() File "e:\text-ai\envs\h2ogpt\lib\site-packages\langchain\document_loaders\unstructured.py", line 71, in load elements = self._get_elements() File "e:\text-ai\envs\h2ogpt\lib\site-packages\langchain\document_loaders\word_document.py", line 98, in _get_elements return partition_doc(filename=self.file_path, *self.unstructured_kwargs) File "e:\text-ai\envs\h2ogpt\lib\site-packages\unstructured\file_utils\filetype.py", line 476, in wrapper elements = func(args, **kwargs) File "e:\text-ai\envs\h2ogpt\lib\site-packages\unstructured\partition\doc.py", line 42, in partition_doc convert_office_doc(filename, tmpdir, target_format="docx") File "e:\text-ai\envs\h2ogpt\lib\site-packages\unstructured\partition\common.py", line 160, in convert_office_doc raise FileNotFoundError( FileNotFoundError: soffice command was not found. Please install libreoffice on your system and try again.

also epub did not work for me thus far. at least with epub, i got a clue, what needs to be installed: pypandoc. maybe missing in the windows installation routine ? there are several methods to install it listed on the pypandoc-page, but I dont know which would be the right one (https://pypi.org/project/pypandoc/).

The error for epub is as follows:

Failed to ingest C:\Users\mech\AppData\Local\Temp\gradio\f975ae57afe3e0d4fd1bc163feb1fa2f607e4e12\xyz.epub due to Traceback (most recent call last): File "E:\TEXT-AI\h2ogpt\src\gpt_langchain.py", line 1299, in path_to_doc1 res = file_to_doc(file, base_path=None, verbose=verbose, fail_any_exception=fail_any_exception, File "E:\TEXT-AI\h2ogpt\src\gpt_langchain.py", line 1156, in file_to_doc docs1 = UnstructuredEPubLoader(file).load() File "e:\text-ai\envs\h2ogpt\lib\site-packages\langchain\document_loaders\unstructured.py", line 71, in load elements = self._get_elements() File "e:\text-ai\envs\h2ogpt\lib\site-packages\langchain\document_loaders\epub.py", line 22, in _get_elements return partition_epub(filename=self.file_path, *self.unstructured_kwargs) File "e:\text-ai\envs\h2ogpt\lib\site-packages\unstructured\file_utils\filetype.py", line 476, in wrapper elements = func(args, **kwargs) File "e:\text-ai\envs\h2ogpt\lib\site-packages\unstructured\partition\epub.py", line 26, in partition_epub return convert_and_partition_html( File "e:\text-ai\envs\h2ogpt\lib\site-packages\unstructured\partition\html.py", line 112, in convert_and_partition_html html_text = convert_file_to_html_text(source_format=source_format, filename=filename, file=file) File "e:\text-ai\envs\h2ogpt\lib\site-packages\unstructured\file_utils\file_conversion.py", line 44, in convert_file_to_html_text html_text = convert_file_to_text( File "e:\text-ai\envs\h2ogpt\lib\site-packages\unstructured\file_utils\file_conversion.py", line 12, in convert_file_to_text text = pypandoc.convert_file(filename, target_format, format=source_format) File "e:\text-ai\envs\h2ogpt\lib\site-packages\pypandoc__init.py", line 168, in convert_file return _convert_input(discovered_source_files, format, 'path', to, extra_args=extra_args, File "e:\text-ai\envs\h2ogpt\lib\site-packages\pypandoc__init__.py", line 324, in _convert_input _ensure_pandoc_path() File "e:\text-ai\envs\h2ogpt\lib\site-packages\pypandoc\init__.py", line 750, in _ensure_pandoc_path raise OSError("No pandoc was found: either install pandoc and add it\n" OSError: No pandoc was found: either install pandoc and add it to your PATH or or call pypandoc.download_pandoc(...) or install pypandoc wheels with included pandoc.

The windows installer of pypan https://github.com/jgm/pandoc/releases did the job, but then an error occured:

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.54s/it] 0it [00:00, ?it/s] load INSTRUCTOR_Transformer max_seq_length 512 Traceback (most recent call last): File "E:\TEXT-AI\h2ogpt\src\gradio_runner.py", line 2346, in update_user_db return _update_user_db(file, db1=db1, chunk=chunk, chunk_size=chunk_size, File "E:\TEXT-AI\h2ogpt\src\gradio_runner.py", line 2477, in _update_user_db db = get_db(sources, use_openai_embedding=use_openai_embedding, File "E:\TEXT-AI\h2ogpt\src\gpt_langchain.py", line 99, in get_db db = Chroma.from_documents(documents=sources, File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\langchain\vectorstores\chroma.py", line 446, in from_documents return cls.from_texts( File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\langchain\vectorstores\chroma.py", line 414, in from_texts chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\langchain\vectorstores\chroma.py", line 159, in add_texts embeddings = self._embedding_function.embed_documents(list(texts)) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\langchain\embeddings\huggingface.py", line 158, in embed_documents embeddings = self.client.encode(instruction_pairs, *self.encode_kwargs) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\InstructorEmbedding\instructor.py", line 539, in encode out_features = self.forward(features) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\torch\nn\modules\container.py", line 217, in forward input = module(input) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\InstructorEmbedding\instructor.py", line 278, in forward assert torch.sum(attention_mask[local_idx]).item() >= context_masks[local_idx].item(),\ RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Adding libre office to PATH in windows fixed the problem with libre office not being found.

But now I get the CUDA Error (black screen, pc freeze included) there too:

To create a public link, set share=True in launch(). 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.73s/it] 0it [00:00, ?it/s] load INSTRUCTOR_Transformer max_seq_length 512 Run time of job "clear_torch_cache (trigger: interval[0:00:20], next run at: 2023-07-16 20:57:57 CEST)" was missed by 0:00:07.446917 Traceback (most recent call last): File "E:\TEXT-AI\h2ogpt\src\gradio_runner.py", line 2346, in update_user_db return _update_user_db(file, db1=db1, chunk=chunk, chunk_size=chunk_size, File "E:\TEXT-AI\h2ogpt\src\gradio_runner.py", line 2477, in _update_user_db db = get_db(sources, use_openai_embedding=use_openai_embedding, File "E:\TEXT-AI\h2ogpt\src\gpt_langchain.py", line 99, in get_db db = Chroma.from_documents(documents=sources, File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\langchain\vectorstores\chroma.py", line 446, in from_documents return cls.from_texts( File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\langchain\vectorstores\chroma.py", line 414, in from_texts chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\langchain\vectorstores\chroma.py", line 159, in add_texts embeddings = self._embedding_function.embed_documents(list(texts)) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\langchain\embeddings\huggingface.py", line 158, in embed_documents embeddings = self.client.encode(instruction_pairs, *self.encode_kwargs) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\InstructorEmbedding\instructor.py", line 539, in encode out_features = self.forward(features) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\torch\nn\modules\container.py", line 217, in forward input = module(input) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "e:\TEXT-AI\envs\h2ogpt\lib\site-packages\InstructorEmbedding\instructor.py", line 278, in forward assert torch.sum(attention_mask[local_idx]).item() >= context_masks[local_idx].item(),\ RuntimeError: CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Thanks. Looks like there is something using cuda in parent when it reaches that point. It should be avoided, but maybe there is still way.

I'll try on windows as soon as I can. Please provide any other repro context.

On the pandoc issue, I think it's because the requirements were modified and pypandoc_binary is only installed for x86_64.

@Mathanraj-Sharma I know you added those requirements.txt conditions for mac, but why isn't "Windows" allowed for pypandoc_binary?

# pandoc==2.3
pypandoc==1.11; sys_platform == "darwin" and platform_machine == "arm64"
pypandoc_binary==1.11; platform_machine == "x86_64"

@pseudotensor I can only see win32 supported wheel for pypandoc_binary. https://pypi.org/project/pypandoc-binary/#files

it could be the problem. I think we need to have pypandoc_binary==1.11; platform_machine == "x86_64" and platform_machine=="win32"

WDYT?

Thanks. Looks like there is something using cuda in parent when it reaches that point. It should be avoided, but maybe there is still way.

I'll try on windows as soon as I can. Please provide any other repro context.

What context should I post?

When I uploaded a PDF (small 3 pages) ,I got it loaded but there is an error and a warning (see below). If I then go on try to talk about the document it works for a short time but in the end BSOD my sys :(

to create a public link, set share=True in launch(). 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 8.20it/s] 0it [00:00, ?it/s] load INSTRUCTOR_Transformer max_seq_length 512 A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' The model 'RWForCausalLM' is not supported for . Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM']. Number of requested results 1000 is greater than number of elements in index 9, updating n_results = 9 e:\TEXT-AI\envs\h2ogpt\lib\site-packages\transformers\generation\utils.py:1259: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation) warnings.warn( Setting pad_token_id to eos_token_id:11 for open-end generation. prompt: <|prompt|>Pay attention and remember information below, which will help to answer the question or imperative after the context ends.

For the cuda issue, a work-around is to disable parallel ingest by passing to generate --n_jobs=1

@pseudotensor I can only see win32 supported wheel for pypandoc_binary. https://pypi.org/project/pypandoc-binary/#files

it could be the problem. I think we need to have pypandoc_binary==1.11; platform_machine == "x86_64" and platform_machine=="win32"

WDYT?

Sounds likely good solution. Did you mean "or" there? I'm not familiar with the conditions.

For the cuda issue, a work-around is to disable parallel ingest by passing to generate --n_jobs=1

this didnt help, I forgot to report.

Tried a 159 page pdf. Is it too big? Fresh install (3rd time :( ).

I can download and run different model types, but loading documents and chatting only worked with very small txt files..

Used this for failed try to load a pdf:

python generate.py --base_model=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3 --langchain_mode=UserData --score_model=None

remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Total 5 (delta 4), reused 5 (delta 4), pack-reused 0
Unpacking objects: 100% (5/5), 590 bytes | 65.00 KiB/s, done.
From https://github.com/h2oai/h2ogpt
   b0bfa0b..ea7dc1b  main       -> origin/main
Updating b0bfa0b..ea7dc1b
Fast-forward
 src/gen.py           | 5 ++++-
 src/gradio_runner.py | 8 +++++---
 2 files changed, 9 insertions(+), 4 deletions(-)
Using Model h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3
Prep: persist_directory=db_dir_UserData exists, using
Starting get_model: h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3
Could not determine --max_seq_len, setting to 2048.  Pass if not correct
Could not determine --max_seq_len, setting to 2048.  Pass if not correct
Could not determine --max_seq_len, setting to 2048.  Pass if not correct
device_map: {'': 0}
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.04s/it]
Model {'base_model': 'h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'prompt_answer', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '<|prompt|>', 'PreInput': None, 'PreResponse': '<|answer|>', 'terminate_response': ['<|prompt|>', '<|answer|>', '<|endoftext|>'], 'chat_sep': '<|endoftext|>', 'chat_turn_sep': '<|endoftext|>', 'humanstr': '<|prompt|>', 'botstr': '<|answer|>', 'generates_leading_space': False}}
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.70it/s]
0it [00:00, ?it/s]
load INSTRUCTOR_Transformer
max_seq_length  512
Traceback (most recent call last):
  File "E:\TEXT-AI\h2ogpt\src\gradio_runner.py", line 2693, in update_user_db
    return _update_user_db(file, db1s=db1s, chunk=chunk, chunk_size=chunk_size,
  File "E:\TEXT-AI\h2ogpt\src\gradio_runner.py", line 2869, in _update_user_db
    db = get_db(sources, use_openai_embedding=use_openai_embedding,
  File "E:\TEXT-AI\h2ogpt\src\gpt_langchain.py", line 113, in get_db
    db = Chroma.from_documents(documents=sources,
  File "e:\TEXT-AI\venv\h2ogpt\lib\site-packages\langchain\vectorstores\chroma.py", line 564, in from_documents
    return cls.from_texts(
  File "e:\TEXT-AI\venv\h2ogpt\lib\site-packages\langchain\vectorstores\chroma.py", line 528, in from_texts
    chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File "e:\TEXT-AI\venv\h2ogpt\lib\site-packages\langchain\vectorstores\chroma.py", line 166, in add_texts
    embeddings = self._embedding_function.embed_documents(list(texts))
  File "e:\TEXT-AI\venv\h2ogpt\lib\site-packages\langchain\embeddings\huggingface.py", line 158, in embed_documents
    embeddings = self.client.encode(instruction_pairs, **self.encode_kwargs)
  File "e:\TEXT-AI\venv\h2ogpt\lib\site-packages\InstructorEmbedding\instructor.py", line 539, in encode
    out_features = self.forward(features)
  File "e:\TEXT-AI\venv\h2ogpt\lib\site-packages\torch\nn\modules\container.py", line 217, in forward
    input = module(input)
  File "e:\TEXT-AI\venv\h2ogpt\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "e:\TEXT-AI\venv\h2ogpt\lib\site-packages\InstructorEmbedding\instructor.py", line 278, in forward
    assert torch.sum(attention_mask[local_idx]).item() >= context_masks[local_idx].item(),\
RuntimeError: CUDA error: invalid program counter
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

h2oai / h2ogpt

Problems uploading office and .epub docs #462