langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
50.17k stars 7.19k forks source link

Doc Extractor only extracts first sheet from XLS #9661

Open infinitnet opened 1 week ago

infinitnet commented 1 week ago

Self Checks

Dify version

0.10.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Upload XLS file with multiple sheet tabs in it and run Doc Extactor.

✔️ Expected Behavior

Doc Extractor should extract all sheets (tabs) within the XLS.

❌ Actual Behavior

Doc Extractor only extracts the first sheet (tab) in the XLS file.

dosubot[bot] commented 1 week ago

The issue you're experiencing with the Doc Extractor only extracting the first sheet from an XLS file might be related to how the extraction process is implemented. The current implementation of the Doc Extractor uses the pandas library with the xlrd engine to read XLS files. It iterates over each sheet in the file, so it should theoretically handle multiple sheets [1].

However, if only the first sheet is being extracted, it could be due to a bug or a specific condition in your setup. There is error handling in place that raises a TextExtractionError if an exception occurs during the extraction process, which might provide insights if there are any underlying issues [2].

To further investigate, you might want to check the logs for any errors or exceptions that occur during the extraction process. Additionally, reviewing the code to ensure that the iteration over sheets is correctly implemented and not inadvertently skipped could be helpful.

To continue talking to Dosu, mention @dosu.