shaunthecomputerscientist / EDA-GPT

Automated Data Analysis leveraging llms
MIT License
147 stars 30 forks source link

Error: Could Not Determine CSV Delimiter During PDF File Processing #5

Open Yasiyass opened 2 weeks ago

Yasiyass commented 2 weeks ago

Hello! I'm encountering an error with the unstructured_Analyzer class in the EDA_GPT.py script when uploading a PDF file. The script attempts to detect and process any structured data (e.g., tables) within the PDF, but it fails during the delimiter detection phase, throwing the following error:

EDA-GPT/lib/python3.10/csv.py", line 187, in sniff
    raise Error("Could not determine delimiter")

The script fails to detect the delimiter for the extracted data, resulting in an error that halts further processing.

../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 85, in exec_func_with_error_handling
    result = func()
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 576, in code_to_exec
    exec(code, module.__dict__)
../EDA-GPT-main/pages/EDA_GPT.py", line 40, in <module>
    st.session_state.unstructured_analyzer.run()
../EDA-GPT-main/pages/src/unstructured_data.py", line 534, in run
    self.workflow()
../EDA-GPT-main/pages/src/unstructured_data.py", line 484, in workflow
    st.session_state.vectorstoreretriever=self._vstore_embeddings(uploaded_files=files)
..miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 168, in wrapper
    return cached_func(*args, **kwargs)
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 197, in __call__
    return self._get_or_create_cached_value(args, kwargs)
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 224, in _get_or_create_cached_value
    return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 280, in _handle_cache_miss
    computed_value = self._info.func(*func_args, **func_kwargs)
../EDA-GPT-main/pages/src/unstructured_data.py", line 161, in _vstore_embeddings
    st.session_state.vectorstoreretriever=st.session_state.vector_store.makevectorembeddings(embedding_num=st.session_state.embeddings)
../EDA-GPT-main/pages/src/vstore.py", line 69, in makevectorembeddings
    self.data = merged_data_loader.load()
../EDA-GPT/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 29, in load
    return list(self.lazy_load())
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/merge.py", line 23, in lazy_load
    for document in data:
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/directory.py", line 182, in lazy_load
    yield from self._lazy_load_file(i, p, pbar)
../envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/directory.py", line 220, in _lazy_load_file
    raise e
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/directory.py", line 210, in _lazy_load_file
    for subdoc in loader.lazy_load():
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/unstructured.py", line 88, in lazy_load
    elements = self._get_elements()
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/unstructured.py", line 180, in _get_elements
    return partition(filename=self.file_path, **self.unstructured_kwargs)
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/partition/auto.py", line 524, in partition
    elements = _partition_csv(
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/documents/elements.py", line 587, in wrapper
    elements = func(*args, **kwargs)
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 618, in wrapper
    elements = func(*args, **kwargs)
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper
    elements = func(*args, **kwargs)
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/partition/csv.py", line 80, in partition_csv
    delimiter = get_delimiter(file_path=filename)
../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/partition/csv.py", line 133, in get_delimiter
    return sniffer.sniff(data, delimiters=",;").delimiter
../miniforge3/envs/EDA-GPT/lib/python3.10/csv.py", line 187, in sniff
    raise Error("Could not determine delimiter")

I'm very interested to see how other features of your unstrcutured data would work. Besides from that your work is great! I'm trying my best to learn from it!

Thank You!

shaunthecomputerscientist commented 2 weeks ago

Yeah it does that for few pdfs. It actually tries to detect if there are any tables in the data. Based on that it tries to make a csv table for it. So it needs a delimiter to make the columns. But sometimes it doesn't find it. I'll try to fix it. It's a limitation of these libraries. But yeah let me know if it works for other pdfs. I will start working on it again. I am currently doing some other project. Post all your complains and slowly i will try to fix them.

On Tue, 3 Sept 2024, 17:22 Yasmin Yazdi, @.***> wrote:

Hello! I'm encountering an error with the unstructured_Analyzer class in the EDA_GPT.py script when uploading a PDF file. The script attempts to detect and process any structured data (e.g., tables) within the PDF, but it fails during the delimiter detection phase, throwing the following error:

EDA-GPT/lib/python3.10/csv.py", line 187, in sniff raise Error("Could not determine delimiter")

The script fails to detect the delimiter for the extracted data, resulting in an error that halts further processing.

../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 85, in exec_func_with_error_handling result = func() ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 576, in code_to_exec exec(code, module.dict) ../EDA-GPT-main/pages/EDA_GPT.py", line 40, in st.session_state.unstructured_analyzer.run() ../EDA-GPT-main/pages/src/unstructured_data.py", line 534, in run self.workflow() ../EDA-GPT-main/pages/src/unstructured_data.py", line 484, in workflow st.session_state.vectorstoreretriever=self._vstore_embeddings(uploaded_files=files) ..miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 168, in wrapper return cached_func(*args, kwargs) ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 197, in call return self._get_or_create_cached_value(args, kwargs) ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 224, in _get_or_create_cached_value return self._handle_cache_miss(cache, value_key, func_args, func_kwargs) ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 280, in _handle_cache_miss computed_value = self._info.func(func_args, func_kwargs) ../EDA-GPT-main/pages/src/unstructured_data.py", line 161, in _vstore_embeddings st.session_state.vectorstoreretriever=st.session_state.vector_store.makevectorembeddings(embedding_num=st.session_state.embeddings) ../EDA-GPT-main/pages/src/vstore.py", line 69, in makevectorembeddings self.data = merged_data_loader.load() ../EDA-GPT/lib/python3.10/site-packages/langchain_core/document_loaders/base.py", line 29, in load return list(self.lazy_load()) ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/merge.py", line 23, in lazy_load for document in data: ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/directory.py", line 182, in lazy_load yield from self._lazy_load_file(i, p, pbar) ../envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/directory.py", line 220, in _lazy_load_file raise e ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/directory.py", line 210, in _lazy_load_file for subdoc in loader.lazy_load(): ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/unstructured.py", line 88, in lazy_load elements = self._get_elements() ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/langchain_community/document_loaders/unstructured.py", line 180, in _get_elements return partition(filename=self.file_path, self.unstructured_kwargs) ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/partition/auto.py", line 524, in partition elements = _partition_csv( ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/documents/elements.py", line 587, in wrapper elements = func(args, kwargs) ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 618, in wrapper elements = func(*args, kwargs) ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 582, in wrapper elements = func(*args, *kwargs) ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper elements = func(args, kwargs) ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/partition/csv.py", line 80, in partition_csv delimiter = get_delimiter(file_path=filename) ../miniforge3/envs/EDA-GPT/lib/python3.10/site-packages/unstructured/partition/csv.py", line 133, in get_delimiter return sniffer.sniff(data, delimiters=",;").delimiter ../miniforge3/envs/EDA-GPT/lib/python3.10/csv.py", line 187, in sniff raise Error("Could not determine delimiter")

I'm very interested to see how other features of your unstrcutured data would work. Besides from that your work is great! I'm trying my best to learn from it!

Thank You!

— Reply to this email directly, view it on GitHub https://github.com/shaunthecomputerscientist/EDA-GPT/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYM4KZ3EIQ2QKWYA5FU2LGTZUWPH7AVCNFSM6AAAAABNR6XMCOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGUYDENRXGQ2TKMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Yasiyass commented 2 weeks ago

Thank you!

FYI, I used your own pdf file in your test.data folder, it didn't give the error but the table it extracted were empty commas.

have you heard of Unstructured IO for extracting tables and pictures from PDF? maybe it is easier?

shaunthecomputerscientist commented 2 weeks ago

Yeah i experimented with tabula, unstructured and other pdf plumber stuff. I once saw pdf plumber pikepdf performing well. But yeah tell me your observations. I think in my pdf there is table in few of them and there isn't table in others. Can you check which ones did not did not produce tables?

On Tue, 3 Sept 2024, 21:17 Yasmin Yazdi, @.***> wrote:

Thank you!

FYI, I used your own pdf file in your test.data folder, it didn't give the error but the table it extracted were empty commas.

have you heard of Unstructured IO for extracting tables and pictures from PDF? maybe it is easier?

— Reply to this email directly, view it on GitHub https://github.com/shaunthecomputerscientist/EDA-GPT/issues/5#issuecomment-2326859978, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYM4KZ62KRRA4EEZEE4WC63ZUXKXJAVCNFSM6AAAAABNR6XMCOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRWHA2TSOJXHA . You are receiving this because you commented.Message ID: @.***>