[Bug]: BB loader in llama hub has a 3 critical issues that makes it unusable in its current form.

pitchdarkdata commented 3 months ago

Bug Description

BB loader in llamahub has critical bugs, I feel it cannot be used in the current form. Issue-1 : The content_url has an additional "/" in the path. Captured the error in the traceback section. Typo makes it unusable. Issue-2: Files without extensions are not handled. BB loaders exits if it parses a file such as Dockerfile that does not have any extension. Issue-3: Any file that has no content is exited too.

Version

llama-index==0.10.59

Steps to Reproduce

Fill in the values for the below code and execute with current BB loader.
You will hit the three issues listed below.

import os import logging import sys

from llama_index.llms.azure_openai import AzureOpenAI from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding from llama_index.core import VectorStoreIndex, SimpleDirectoryReader from llama_index.core import ServiceContext, set_global_service_context from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.core import Settings from llama_index.core import VectorStoreIndex, download_loader from llama_index.readers.bitbucket import BitbucketReader

os.environ["OPENAI_API_KEY"] = "fill in here" os.environ["AZURE_OPENAI_ENDPOINT"] = "https://ai-foundation-api.app/ai-foundation/chat-ai/gpt4" api_key = "" azure_endpoint = "https://ai-foundation-api.app/ai-foundation/chat-ai/gpt4" api_version = "2023-05-15" os.environ["BITBUCKET_USERNAME"] = "fill in here" os.environ["BITBUCKET_API_KEY"] = "fill in here" base_url = "fillin here" project_key = "fill in here" repo = "fill in here"

llm = AzureOpenAI( model="gpt-4", deployment_name="my-custom-llm", api_key=api_key, azure_endpoint=azure_endpoint, api_version = "2023-05-15", ) service_context = ServiceContext.from_defaults(llm=llm, chunk_size=800, chunk_overlap=20) embed_model_bge = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5") text_embeddings = embed_model_bge.get_text_embedding("AI is awesome!")

Settings.llm = llm Settings.embed_model = embed_model_bge

loader = BitbucketReader( base_url=base_url, project_key=project_key, branch="refs/heads/master", repository=repo, extensions_to_skip=['json'], ) documents = loader.load_data() index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine()

Relevant Logs/Tracbacks

Issue-1 : content_url is defined incorrectly in line 73, whereas correctly defined in 105: Wonder how it worked earlier. 
Traceback logs:
Traceback (most recent call last):
  File "/home/pn/ragbb.py", line 62, in <module>
    documents = loader.load_data()
                ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/llama_index/readers/bitbucket/base.py", line 139, in load_data
    self.load_all_file_paths(
  File "/usr/local/lib/python3.12/site-packages/llama_index/readers/bitbucket/base.py", line 94, in load_all_file_paths
    self.load_all_file_paths(
  File "/usr/local/lib/python3.12/site-packages/llama_index/readers/bitbucket/base.py", line 82, in load_all_file_paths
    raise ValueError(response["errors"])
ValueError: [{'context': None, 'message': 'The path "/bin" does not exist at revision "refs/heads/master"', 'exceptionName': 'com.atlassian.bitbucket.content.NoSuchPathException'}]

Issue 2 : for files with no extention such as Dockerfile the implementation fails as extentions for files is a key. Solution : Change the logic to ignore if files do not have extensions. 
Traceback (most recent call last):
  File "/home/pn/ragbb.py", line 62, in <module>
    documents = loader.load_data()
                ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/llama_index/readers/bitbucket/base.py", line 141, in load_data
    self.load_all_file_paths(
  File "/usr/local/lib/python3.12/site-packages/llama_index/readers/bitbucket/base.py", line 96, in load_all_file_paths
    self.load_all_file_paths(
  File "/usr/local/lib/python3.12/site-packages/llama_index/readers/bitbucket/base.py", line 88, in load_all_file_paths
    if value["path"]["extension"] not in self.extensions_to_skip and value["size"] > 0:
       ~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'extension'

Issue-3 : Empty files are not handled in the BB loader. Add a condition to gracefully skip files with no content instead of exiting.
Traceback (most recent call last):
  File "/home/pn/ragbb.py", line 63, in <module>
    index = VectorStoreIndex.from_documents(documents)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/llama_index/core/indices/base.py", line 145, in from_documents
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/site-packages/llama_index/core/indices/vector_store/base.py", line 78, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/site-packages/llama_index/core/indices/base.py", line 94, in __init__
    index_struct = self.build_index_from_nodes(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/llama_index/core/indices/vector_store/base.py", line 309, in build_index_from_nodes
    raise ValueError(
ValueError: Cannot build index from nodes with no content. Please ensure all nodes have content.
pip freeze : 
accelerate==0.33.0
aiohappyeyeballs==2.3.4
aiohttp==3.10.1
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.4.0
atlassian-python-api==3.41.14
attrs==24.1.0
azure-core==1.30.2
azure-identity==1.17.1
beautifulsoup4==4.12.3
certifi==2024.2.2
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
coloredlogs==15.0.1
cryptography==43.0.0
cssselect2==0.7.0
dataclasses-json==0.6.7
datasets==2.20.0
Deprecated==1.2.14
dill==0.3.8
dirtyjson==1.0.8
distlib==0.3.8
distro==1.9.0
docx2txt==0.8
filelock==3.14.0
flatbuffers==24.3.25
frozenlist==1.4.1
fsspec==2024.5.0
greenlet==3.0.3
h11==0.14.0
html2text==2020.1.16
httpcore==1.0.5
httpx==0.27.0
huggingface-hub==0.24.5
humanfriendly==10.0
idna==3.7
InstructorEmbedding==1.0.1
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
llama-cloud==0.0.11
llama-index==0.10.59
llama-index-agent-openai==0.2.9
llama-index-cli==0.1.13
llama-index-core==0.10.59
llama-index-embeddings-azure-openai==0.1.11
llama-index-embeddings-huggingface==0.2.2
llama-index-embeddings-openai==0.1.11
llama-index-indices-managed-llama-cloud==0.2.7
llama-index-legacy==0.9.48
llama-index-llms-azure-openai==0.1.10
llama-index-llms-openai==0.1.27
llama-index-multi-modal-llms-openai==0.1.8
llama-index-program-openai==0.1.7
llama-index-question-gen-openai==0.1.3
llama-index-readers-bitbucket==0.1.3
llama-index-readers-confluence==0.1.7
llama-index-readers-file==0.1.32
llama-index-readers-llama-parse==0.1.6
llama-parse==0.4.9
lxml==5.2.2
MarkupSafe==2.1.5
marshmallow==3.21.3
minijinja==2.0.1
mpmath==1.3.0
msal==1.30.0
msal-extensions==1.2.0
multidict==6.0.5
multiprocess==0.70.16
mypy-extensions==1.0.0
nest-asyncio==1.6.0
networkx==3.3
nltk==3.8.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
onnx==1.16.2
onnxruntime==1.18.1
openai==1.39.0
optimum==1.21.2
packaging==24.1
pandas==2.2.2
pdf2image==1.17.0
pillow==10.4.0
pipenv==2023.12.1
platformdirs==4.2.2
portalocker==2.10.1
protobuf==5.27.3
psutil==6.0.0
pyarrow==17.0.0
pyarrow-hotfix==0.6
pycparser==2.22
pydantic==2.8.2
pydantic_core==2.20.1
PyJWT==2.9.0
pypdf==4.3.1
pytesseract==0.3.10
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
regex==2024.7.24
reportlab==4.2.2
requests==2.32.3
requests-oauthlib==2.0.0
retrying==1.3.4
safetensors==0.4.4
scikit-learn==1.5.1
scipy==1.14.0
sentence-transformers==3.0.1
sentencepiece==0.2.0
setuptools==69.5.1
six==1.16.0
sniffio==1.3.1
soupsieve==2.5
SQLAlchemy==2.0.32
striprtf==0.0.26
svglib==1.5.1
sympy==1.13.1
tenacity==8.5.0
threadpoolctl==3.5.0
tiktoken==0.7.0
timm==1.0.8
tinycss2==1.3.0
tokenizers==0.19.1
torch==2.4.0
torchvision==0.19.0
tqdm==4.66.5
transformers==4.42.4
triton==3.0.0
typing-inspect==0.9.0
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
uv==0.2.4
virtualenv==20.26.2
webencodings==0.5.1
wheel==0.43.0
wrapt==1.16.0
xlrd==2.0.1
xxhash==3.4.1
yarl==1.9.4

logan-markewich commented 3 months ago

Feel free to open a PR 👍🏻

pitchdarkdata commented 3 months ago

Hi @logan-markewich, Thanks for the invitation, I have the fix and want to socialize it. Can you share the procedure to raise a PR.

pitchdarkdata commented 3 months ago

Feel free to open a PR 👍🏻

[Bug]: BB loader in llama hub has a 3 critical issues that makes it unusable in its current form. #15158 Please review and provide feedback if any

run-llama / llama_index