run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.23k stars 4.95k forks source link

[Bug]: Keep getting 'Current user not permitted to use Confluence' but able to connect with the Confluence SDK #14836

Closed r13i closed 1 month ago

r13i commented 1 month ago

Bug Description

I am trying to fetch Confluence documents using API tokens, and the fetch works fine using the Atlassian Confluence SDK atlassian-python-api:

from atlassian import Confluence

confluence = Confluence(
    url='https://example.atlassian.net/',
    username='user@example.com',
    password=CONFLUENCE_API_TOKEN,
)

pages = confluence.get_all_pages_from_space(
    'MY_SPACE',
    start=0,
    limit=50,
    expand="body.export_view.value",
    content_type="page",
)

print(len(pages))
// 50

However, using the same parameters in the ConfluenceReader from Llama Index mapped behind the scenes to the exact same parameters in the Atlassian SDK:

from llama_index.readers.confluence import ConfluenceReader

reader = ConfluenceReader(
    base_url='https://example.atlassian.net/wiki',
    user_name='user@example.com',
    password=CONFLUENCE_API_TOKEN,
)

documents = reader.load_data(
    space_key='MY_SPACE',
    include_attachments=False,
)

I get HTTPError: Current user not permitted to use Confluence.

Version:

pip freeze | grep llama-index
llama-index==0.10.55
llama-index-agent-openai==0.2.8
llama-index-cli==0.1.12
llama-index-core==0.10.55
llama-index-embeddings-openai==0.1.10
llama-index-indices-managed-llama-cloud==0.2.5
llama-index-legacy==0.9.48
llama-index-llms-openai==0.1.25
llama-index-multi-modal-llms-openai==0.1.7
llama-index-program-openai==0.1.6
llama-index-question-gen-openai==0.1.3
llama-index-readers-confluence==0.1.6
llama-index-readers-file==0.1.30
llama-index-readers-llama-parse==0.1.6

Version

0.10.55

Steps to Reproduce

See description above

Relevant Logs/Tracbacks

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 documents = reader.load_data(
      2     space_key='MY_SPACE',
      3     # cql='type=page and space=MY_SPACE',
      4     include_attachments=False,
      5     max_num_results=1
      6 )

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/llama_index/readers/confluence/base.py:174, in ConfluenceReader.load_data(self, space_key, page_ids, page_status, label, cql, include_attachments, include_children, start, cursor, limit, max_num_results)
    171 pages: List = []
    172 if space_key:
    173     pages.extend(
--> 174         self._get_data_with_paging(
    175             self.confluence.get_all_pages_from_space,
    176             start=start,
    177             max_num_results=max_num_results,
    178             space=space_key,
    179             status=page_status,
    180             expand="body.export_view.value",
    181             content_type="page",
    182         )
    183     )
    184 elif label:
    185     pages.extend(
    186         self._get_cql_data_with_paging(
    187             start=start,
   (...)
    192         )
    193     )

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/llama_index/readers/confluence/base.py:265, in ConfluenceReader._get_data_with_paging(self, paged_function, start, max_num_results, **kwargs)
    263 ret = []
    264 while True:
--> 265     results = self._get_data_with_retry(
    266         paged_function, start=start, limit=max_num_remaining, **kwargs
    267     )
    268     ret.extend(results)
    269     if (
    270         len(results) == 0
    271         or max_num_results is not None
    272         and len(results) >= max_num_remaining
    273     ):

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/retrying.py:56, in retry.<locals>.wrap.<locals>.wrapped_f(*args, **kw)
     54 @six.wraps(f)
     55 def wrapped_f(*args, **kw):
---> 56     return Retrying(*dargs, **dkw).call(f, *args, **kw)

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/retrying.py:266, in Retrying.call(self, fn, *args, **kwargs)
    263 if self.stop(attempt_number, delay_since_first_attempt_ms):
    264     if not self._wrap_exception and attempt.has_exception:
    265         # get() on an attempt with an exception should cause it to be raised, but raise just in case
--> 266         raise attempt.get()
    267     else:
    268         raise RetryError(attempt)

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/retrying.py:301, in Attempt.get(self, wrap_exception)
    299         raise RetryError(self)
    300     else:
--> 301         six.reraise(self.value[0], self.value[1], self.value[2])
    302 else:
    303     return self.value

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/six.py:719, in reraise(tp, value, tb)
    717     if value.__traceback__ is not tb:
    718         raise value.with_traceback(tb)
--> 719     raise value
    720 finally:
    721     value = None

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/retrying.py:251, in Retrying.call(self, fn, *args, **kwargs)
    248     self._before_attempts(attempt_number)
    250 try:
--> 251     attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
    252 except:
    253     tb = sys.exc_info()

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/llama_index/readers/confluence/base.py:332, in ConfluenceReader._get_data_with_retry(self, function, **kwargs)
    330 @retry(stop_max_attempt_number=1, wait_fixed=4)
    331 def _get_data_with_retry(self, function, **kwargs):
--> 332     return function(**kwargs)

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/atlassian/confluence.py:630, in Confluence.get_all_pages_from_space(self, space, start, limit, status, expand, content_type)
    605 def get_all_pages_from_space(
    606     self,
    607     space,
   (...)
    612     content_type="page",
    613 ):
    614     """
    615     Get all pages from space
    616 
   (...)
    628     :return:
    629     """
--> 630     return self.get_all_pages_from_space_raw(
    631         space=space, start=start, limit=limit, status=status, expand=expand, content_type=content_type
    632     ).get("results")

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/atlassian/confluence.py:593, in Confluence.get_all_pages_from_space_raw(self, space, start, limit, status, expand, content_type)
    590     params["type"] = content_type
    592 try:
--> 593     response = self.get(url, params=params)
    594 except HTTPError as e:
    595     if e.response.status_code == 404:

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/atlassian/rest_client.py:341, in AtlassianRestAPI.get(self, path, data, flags, params, headers, not_json_response, trailing, absolute, advanced_mode)
    316 def get(
    317     self,
    318     path,
   (...)
    326     advanced_mode=False,
    327 ):
    328     """
    329     Get request based on the python-requests module. You can override headers, and also, get not json response
    330     :param path:
   (...)
    339     :return:
    340     """
--> 341     response = self.request(
    342         "GET",
    343         path=path,
    344         flags=flags,
    345         params=params,
    346         data=data,
    347         headers=headers,
    348         trailing=trailing,
    349         absolute=absolute,
    350         advanced_mode=advanced_mode,
    351     )
    352     if self.advanced_mode or advanced_mode:
    353         return response

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/atlassian/rest_client.py:313, in AtlassianRestAPI.request(self, method, path, data, json, flags, params, headers, files, trailing, absolute, advanced_mode)
    310 if self.advanced_mode or advanced_mode:
    311     return response
--> 313 self.raise_for_status(response)
    314 return response

File ~/playground/llm-gettingstarted/venv/lib/python3.11/site-packages/atlassian/confluence.py:3336, in Confluence.raise_for_status(self, response)
   3334     response.raise_for_status()
   3335 else:
-> 3336     raise HTTPError(error_msg, response=response)

HTTPError: Current user not permitted to use Confluence
dosubot[bot] commented 1 month ago

Hey there, @r13i! I'm here to help you with any bugs, questions, or contributions. Let's tackle this together while we wait for a human maintainer to join the conversation.

The HTTPError: Current user not permitted to use Confluence error raised by the ConfluenceReader from Llama Index, while the Atlassian Confluence SDK works fine with the same parameters, could be due to differences in how the authentication and permissions are handled between the two implementations.

In the ConfluenceReader class, the authentication can be done using OAuth2, API token, or basic authentication (username and password). If the parameters provided do not match the expected format or if there are missing permissions for the user, the ConfluenceReader might raise an HTTPError.

Here is a snippet from the ConfluenceReader class that shows the initialization and authentication process:

def __init__(
    self,
    base_url: str = None,
    oauth2: Optional[Dict] = None,
    cloud: bool = True,
    api_token: Optional[str] = None,
    user_name: Optional[str] = None,
    password: Optional[str] = None,
) -> None:
    if base_url is None:
        raise ValueError("Must provide `base_url`")

    self.base_url = base_url

    try:
        from atlassian import Confluence
    except ImportError:
        raise ImportError(
            "`atlassian` package not found, please run `pip install"
            " atlassian-python-api`"
        )
    self.confluence: Confluence = None
    if oauth2:
        self.confluence = Confluence(url=base_url, oauth2=oauth2, cloud=cloud)
    else:
        api_token = api_token or os.getenv(CONFLUENCE_API_TOKEN)
        if api_token is not None:
            self.confluence = Confluence(url=base_url, token=api_token, cloud=cloud)
        else:
            user_name = user_name or os.getenv(CONFLUENCE_USERNAME)
            if user_name is None:
                raise ValueError(
                    "Must set environment variable `CONFLUENCE_USERNAME` if oauth,"
                    " oauth2, or `CONFLUENCE_API_TOKEN` are not provided."
                )
            password = password or os.getenv(CONFLUENCE_PASSWORD)
            if password is None:
                raise ValueError(
                    "Must set environment variable `CONFLUENCE_PASSWORD` if oauth,"
                    " oauth2, or `CONFLUENCE_API_TOKEN` are not provided."
                )
            self.confluence = Confluence(
                url=base_url, username=user_name, password=password, cloud=cloud
            )

Ensure that the parameters provided for authentication (OAuth2, API token, username, and password) are correct and that the user has the necessary permissions to access the Confluence instance. Additionally, verify that the base_url is correctly formatted and includes the /wiki suffix if required.

If the Atlassian Confluence SDK works fine with the same parameters, it might be worth checking if there are any differences in the permissions or scopes granted to the user when using the SDK versus the ConfluenceReader.

According to the Llama Index documentation, the base_url must end with /wiki, and the CONFLUENCE_PASSWORD should be an API token obtained from the Atlassian profile security settings. Ensure that these details are correctly configured:

The user needs to specify the base URL for a Confluence instance to initialize the ConfluenceReader - base URL needs to end with `/wiki`. The user can optionally specify OAuth 2.0 credentials to authenticate with the Confluence instance. If no credentials are specified, the loader will look for `CONFLUENCE_API_TOKEN` or `CONFLUENCE_USERNAME`/`CONFLUENCE_PASSWORD` environment variables to proceed with basic authentication. Keep in mind `CONFLUENCE_PASSWORD` is not your actual password, but an API Token obtained here: https://id.atlassian.com/manage-profile/security/api-tokens.

Make sure to follow these guidelines to avoid the HTTPError [1][2].

To continue talking to Dosu, mention @dosu.

skvrd commented 1 month ago

Checked locally, the code

from llama_index.readers.confluence import ConfluenceReader

reader = ConfluenceReader(
    base_url='https://example.atlassian.net/wiki',
    user_name='user@example.com',
    password=CONFLUENCE_API_TOKEN,
)

documents = reader.load_data(
    space_key='MY_SPACE',
    include_attachments=False,
)

works just fine.

Are you sure you use the right token in the second example?

r13i commented 1 month ago

@skvrd thanks for checking. Yes I'm using the exact same values (same context on a Jupyter notebook) as I shared in the bug description.

r13i commented 1 month ago

@skvrd, also the same credentials work fine with LangChain's Confluence loader:

from langchain_community.document_loaders import ConfluenceLoader

loader = ConfluenceLoader(
    url='https://example.atlassian.net/wiki',
    username='user@example.com',
    api_key=CONFLUENCE_API_TOKEN,
)

documents = loader.load(
    space_key='MY_SPACE',
    include_attachments=False,
    max_pages=50
)

print(len(documents))
// 50
skvrd commented 1 month ago

Very strange. Just checked the code again on my end, and it works fine. Here is entire code:

from llama_index.readers.confluence import ConfluenceReader

token = "token goes here"

reader = ConfluenceReader(
    base_url="https://site-goes-here.atlassian.net/wiki",
    user_name="myemail@example.com",
    password=token,
)

documents = reader.load_data(
    space_key='SPACE_KEY_GOES_HERE',
    include_attachments=False,
)

print(documents)
// Prints fetched documents

What is response of this pip freeze | grep atlassian ?

skvrd commented 1 month ago

Ok, I think I found the issue:

inside ConfluenceReader:

            api_token = api_token or os.getenv(CONFLUENCE_API_TOKEN)
            if api_token is not None:
                self.confluence = Confluence(url=base_url, token=api_token, cloud=cloud)

So basically if CONFLUENCE_API_TOKEN is set, it will use following type of auth: Confluence(url=base_url, token=api_token, cloud=cloud) and ignore the provided user_name and password.

r13i commented 1 month ago

Great finding @skvrd , unsetting the environment variable actually solves it:

del os.environ['CONFLUENCE_API_TOKEN']

... but is this behaviour desirable? I guess priority should be given to ALL the variables provided by the user, and if not the logic should fall back to checking the environment variables?

r13i commented 1 month ago

Referring specifically to this snippet in the README:

If no credentials are specified, the loader will look for CONFLUENCE_API_TOKEN or CONFLUENCE_USERNAME/CONFLUENCE_PASSWORD environment variables to proceed with basic authentication.

r13i commented 1 month ago

@skvrd I'm proposing the following PR for fixing: https://github.com/run-llama/llama_index/pull/14905 cc @nerdai

r13i commented 1 month ago

Issue resolved in the release version 0.1.7 (https://github.com/run-llama/llama_index/pull/14905).