Open infinitnet opened 1 week ago
The issue you're experiencing with the Doc Extractor only extracting the first sheet from an XLS file might be related to how the extraction process is implemented. The current implementation of the Doc Extractor uses the pandas
library with the xlrd
engine to read XLS files. It iterates over each sheet in the file, so it should theoretically handle multiple sheets [1].
However, if only the first sheet is being extracted, it could be due to a bug or a specific condition in your setup. There is error handling in place that raises a TextExtractionError
if an exception occurs during the extraction process, which might provide insights if there are any underlying issues [2].
To further investigate, you might want to check the logs for any errors or exceptions that occur during the extraction process. Additionally, reviewing the code to ensure that the iteration over sheets is correctly implemented and not inadvertently skipped could be helpful.
To continue talking to Dosu, mention @dosu.
Self Checks
Dify version
0.10.0
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
Upload XLS file with multiple sheet tabs in it and run Doc Extactor.
✔️ Expected Behavior
Doc Extractor should extract all sheets (tabs) within the XLS.
❌ Actual Behavior
Doc Extractor only extracts the first sheet (tab) in the XLS file.