run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.98k stars 5.3k forks source link

[Bug]: load pptx with error: tuple index out of range. Skipping... use mps device #17009

Open wuhongsheng opened 6 days ago

wuhongsheng commented 6 days ago

Bug Description

image

Version

0.12.0

Steps to Reproduce

def test_ppt():

documents = SimpleDirectoryReader

reader = SimpleDirectoryReader(input_files=["./temp/test.pptx"])
documents = reader.load_data()
print(documents)

Relevant Logs/Tracbacks

"n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 50256,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.46.0",
  "use_cache": true,
  "vocab_size": 50257
}

/Users/whs/miniconda3/envs/agent/lib/python3.11/site-packages/transformers/models/vit/feature_extraction_vit.py:28: FutureWarning: The class ViTFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ViTImageProcessor instead.
  warnings.warn(
Failed to load file temp/test.pptx with error: tuple index out of range. Skipping...
dosubot[bot] commented 6 days ago

The "tuple index out of range" error you're encountering when loading a .pptx file with SimpleDirectoryReader in version 0.12.0 of the library might be due to a bug or limitation in the library's handling of PowerPoint files. Unfortunately, there is no specific documentation or known issues related to this error in the available information for version 0.12.0 [1].

To troubleshoot this issue, you might want to check the following:

  1. Dependencies: Ensure that all required dependencies for handling .pptx files are correctly installed. The SimpleDirectoryReader class relies on libraries such as torch, transformers, python-pptx, and Pillow for processing PowerPoint files [2].

  2. File Integrity: Verify that the .pptx file is not corrupted and can be opened with other software like Microsoft PowerPoint or a compatible viewer.

  3. Library Version: Consider checking if there are any updates or patches available for the library that might address this issue.

  4. Error Handling: Implement additional error handling in your code to capture more detailed error messages or stack traces that might provide further insights into the problem.

If these steps do not resolve the issue, you may need to explore alternative methods for processing .pptx files or consult the library's documentation for any updates or community discussions related to this error.

To continue talking to Dosu, mention @dosu.