katanaml / sparrow

Data processing with ML and LLM
https://katanaml.io
GNU General Public License v3.0
3.45k stars 359 forks source link

When running Unstructured, { ModuleNotFoundError: No module named 'backoff._typing' } #52

Closed pitbuk101 closed 4 months ago

pitbuk101 commented 4 months ago

(.env_unstructured) root@testvm:/home/testvmadmin/main/sparrow/sparrow-ml/llm# pip install backoff==1.11.1 Collecting backoff==1.11.1 Using cached backoff-1.11.1-py2.py3-none-any.whl (13 kB) Installing collected packages: backoff Attempting uninstall: backoff Found existing installation: backoff 2.2.1 Uninstalling backoff-2.2.1: Successfully uninstalled backoff-2.2.1 Successfully installed backoff-1.11.1 WARNING: You are using pip version 22.0.4; however, version 24.0 is available. You should consider upgrading via the '/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/bin/python -m pip install --upgrade pip' command. (.env_unstructured) root@testvm:/home/testvmadmin/main/sparrow/sparrow-ml/llm# ./sparrow.sh "invoice_number, invoice_date, total_gross_worth" "int, str, str" --agent unstructured --file-path ./data/invoice_1.pdf Detected Python version: Python 3.10.4

Running pipeline with unstructured

⠸ Processing file with unstructured...Traceback (most recent call last): File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/bin/unstructured-ingest", line 5, in from unstructured.ingest.main import main File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/main.py", line 2, in from unstructured.ingest.cli.cli import get_cmd File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/cli/init.py", line 5, in from unstructured.ingest.cli.cmds import base_dest_cmd_fns, base_src_cmd_fns File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/cli/cmds/init.py", line 6, in from unstructured.ingest.cli.base.src import BaseSrcCmd File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/cli/base/src.py", line 13, in from unstructured.ingest.runner import runner_map File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/runner/init.py", line 4, in from .airtable import AirtableRunner File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/runner/airtable.py", line 7, in from unstructured.ingest.runner.base_runner import Runner File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/runner/base_runner.py", line 20, in from unstructured.ingest.processor import process_documents File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/processor.py", line 15, in from unstructured.ingest.pipeline import ( File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/pipeline/init.py", line 1, in from .doc_factory import DocFactory File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/pipeline/doc_factory.py", line 4, in from unstructured.ingest.pipeline.interfaces import DocFactoryNode File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/pipeline/interfaces.py", line 15, in from unstructured.ingest.ingest_backoff import RetryHandler File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/ingest_backoff/init.py", line 1, in from ._wrapper import RetryHandler File "/home/testvmadmin/main/sparrow/sparrow-ml/llm/.env_unstructured/lib/python3.10/site-packages/unstructured/ingest/ingest_backoff/_wrapper.py", line 9, in from backoff._typing import ( ModuleNotFoundError: No module named 'backoff._typing' Command failed. Error: ⠴ Processing file with unstructured... ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /home/testvmadmin/main/sparrow/sparrow-ml/llm/engine.py:31 in run │ │ │ │ 28 │ │ │ 29 │ try: │ │ 30 │ │ rag = get_pipeline(user_selected_agent) │ │ ❱ 31 │ │ rag.run_pipeline(user_selected_agent, query_inputs_arr, query_t │ │ 32 │ │ │ │ │ │ debug) │ │ 33 │ except ValueError as e: │ │ 34 │ │ print(f"Caught an exception: {e}") │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ agent = 'unstructured' │ │ │ │ debug = False │ │ │ │ file_path = './data/invoice_1.pdf' │ │ │ │ index_name = None │ │ │ │ inputs = 'invoice_number, invoice_date, total_gross_worth' │ │ │ │ options = None │ │ │ │ query = 'retrieve invoice_number, invoice_date, │ │ │ │ total_gross_worth' │ │ │ │ query_inputs_arr = [ │ │ │ │ │ 'invoice_number', │ │ │ │ │ 'invoice_date', │ │ │ │ │ 'total_gross_worth' │ │ │ │ ] │ │ │ │ query_types = 'int, str, str' │ │ │ │ query_types_arr = ['int', 'str', 'str'] │ │ │ │ rag = <rag.agents.unstructured.unstructured.Unstructure… │ │ │ │ object at 0x7f1d83b82f80> │ │ │ │ types = 'int, str, str' │ │ │ │ user_selected_agent = 'unstructured' │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /home/testvmadmin/main/sparrow/sparrow-ml/llm/rag/agents/unstructured/unstru │ │ ctured.py:71 in run_pipeline │ │ │ │ 68 │ │ │ │ │ 69 │ │ │ os.makedirs(temp_output_dir, exist_ok=True) │ │ 70 │ │ │ │ │ ❱ 71 │ │ │ files = self.invoke_pipeline_step( │ │ 72 │ │ │ │ lambda: self.process_files(temp_output_dir, temp_input │ │ 73 │ │ │ │ "Processing file with unstructured...", │ │ 74 │ │ │ │ local │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ debug = False │ │ │ │ device = 'cpu' │ │ │ │ embedding_model_name = 'all-MiniLM-L6-v2' │ │ │ │ file_path = './data/invoice_1.pdf' │ │ │ │ index_name = None │ │ │ │ input_dir = 'data/pdf' │ │ │ │ local = True │ │ │ │ options = None │ │ │ │ output_dir = 'data/json' │ │ │ │ payload = 'unstructured' │ │ │ │ query = 'retrieve invoice_number, invoice_date, │ │ │ │ total_gross_worth' │ │ │ │ query_inputs = [ │ │ │ │ │ 'invoice_number', │ │ │ │ │ 'invoice_date', │ │ │ │ │ 'total_gross_worth' │ │ │ │ ] │ │ │ │ query_types = ['int', 'str', 'str'] │ │ │ │ self = <rag.agents.unstructured.unstructured.Unstructur… │ │ │ │ object at 0x7f1d83b82f80> │ │ │ │ start = 6444.234209002 │ │ │ │ temp_dir = '/tmp/tmpf7ym66qi' │ │ │ │ temp_input_dir = '/tmp/tmpf7ym66qi/data/pdf' │ │ │ │ temp_output_dir = '/tmp/tmpf7ym66qi/data/json' │ │ │ │ weaviate_url = 'http://localhost:8080' │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /home/testvmadmin/main/sparrow/sparrow-ml/llm/rag/agents/unstructured/unstru │ │ ctured.py:364 in invoke_pipeline_step │ │ │ │ 361 │ │ │ │ │ transient=False, │ │ 362 │ │ │ ) as progress: │ │ 363 │ │ │ │ progress.add_task(description=task_description, total= │ │ ❱ 364 │ │ │ │ ret = task_call() │ │ 365 │ │ else: │ │ 366 │ │ │ print(task_description) │ │ 367 │ │ │ ret = task_call() │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ local = True │ │ │ │ progress = <rich.progress.Progress object at 0x7f1ca9cb5c00> │ │ │ │ self = <rag.agents.unstructured.unstructured.UnstructuredPi… │ │ │ │ object at 0x7f1d83b82f80> │ │ │ │ task_call = <function │ │ │ │ UnstructuredPipeline.run_pipeline.. │ │ │ │ at 0x7f1d83b8cee0> │ │ │ │ task_description = 'Processing file with unstructured...' │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /home/testvmadmin/main/sparrow/sparrow-ml/llm/rag/agents/unstructured/unstru │ │ ctured.py:72 in │ │ │ │ 69 │ │ │ os.makedirs(temp_output_dir, exist_ok=True) │ │ 70 │ │ │ │ │ 71 │ │ │ files = self.invoke_pipeline_step( │ │ ❱ 72 │ │ │ │ lambda: self.process_files(temp_output_dir, temp_input │ │ 73 │ │ │ │ "Processing file with unstructured...", │ │ 74 │ │ │ │ local │ │ 75 │ │ │ ) │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ self = <rag.agents.unstructured.unstructured.UnstructuredPip… │ │ │ │ object at 0x7f1d83b82f80> │ │ │ │ temp_input_dir = '/tmp/tmpf7ym66qi/data/pdf' │ │ │ │ temp_output_dir = '/tmp/tmpf7ym66qi/data/json' │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /home/testvmadmin/main/sparrow/sparrow-ml/llm/rag/agents/unstructured/unstru │ │ ctured.py:123 in process_files │ │ │ │ 120 │ │ return answer │ │ 121 │ │ │ 122 │ def process_files(self, temp_output_dir, temp_input_dir): │ │ ❱ 123 │ │ self.process_local(output_dir=temp_output_dir, num_processes=2 │ │ 124 │ │ files = self.get_result_files(temp_output_dir) │ │ 125 │ │ return files │ │ 126 │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ self = <rag.agents.unstructured.unstructured.UnstructuredPip… │ │ │ │ object at 0x7f1d83b82f80> │ │ │ │ temp_input_dir = '/tmp/tmpf7ym66qi/data/pdf' │ │ │ │ temp_output_dir = '/tmp/tmpf7ym66qi/data/json' │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /home/testvmadmin/main/sparrow/sparrow-ml/llm/rag/agents/unstructured/unstru │ │ ctured.py:171 in process_local │ │ │ │ 168 │ │ │ print(output.decode()) │ │ 169 │ │ else: │ │ 170 │ │ │ print('Command failed. Error:') │ │ ❱ 171 │ │ │ print(error.decode()) │ │ 172 │ │ │ 173 │ def get_result_files(self, folder_path) -> List[Dict]: │ │ 174 │ │ file_list = [] │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ command = [ │ │ │ │ │ 'unstructured-ingest', │ │ │ │ │ 'local', │ │ │ │ │ '--input-path', │ │ │ │ │ '/tmp/tmpf7ym66qi/data/pdf', │ │ │ │ │ '--output-dir', │ │ │ │ │ '/tmp/tmpf7ym66qi/data/json', │ │ │ │ │ '--num-processes', │ │ │ │ │ '2', │ │ │ │ │ '--recursive', │ │ │ │ │ '--verbose' │ │ │ │ ] │ │ │ │ error = None │ │ │ │ input_path = '/tmp/tmpf7ym66qi/data/pdf' │ │ │ │ num_processes = 2 │ │ │ │ output = b'' │ │ │ │ output_dir = '/tmp/tmpf7ym66qi/data/json' │ │ │ │ process = <Popen: returncode: 1 args: ['unstructured-ingest', │ │ │ │ 'local', '--input-path',...> │ │ │ │ self = <rag.agents.unstructured.unstructured.UnstructuredPipel… │ │ │ │ object at 0x7f1d83b82f80> │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ ╰──────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'NoneType' object has no attribute 'decode'

abaranovskis-redsamurai commented 4 months ago

Hey, haven't seen this error. Can't advice

apariciojuan commented 2 months ago

I solved install backoff 2.0, because backoff 1.11.1 don't have _typing file