Ancestor lookup fails on Confluence Cloud

twoodhouse commented 1 year ago

Setting debug logs for my request, I can see that lookup by the ancestor query param does not work correctly with confluence cloud.

DEBUG:urllib3.connectionpool:https://hostname:443 "GET /wiki/rest/api/content/search?cql=ancestor%3D227314662&limit=1000 HTTP/1.1" 200 None

No pages are returned. This means that this confluence builder cannot determine what pages actually exist under the parent page. Hence, this app treats all pages as new, and fails to delete pages in the cloud which have been removed on the source side.

Seems like there's a ticket about this in with Atlassian and they haven't done anything about it so far: https://jira.atlassian.com/browse/CONFSERVER-56163

This project should consider finding an alternative way to get pages under a parent page. Perhaps this: https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-content---children-and-descendants/#api-wiki-rest-api-content-id-descendant-get

twoodhouse commented 1 year ago

Replacing the function "get_descendants" in publisher.py with the following seems to be working. If someone gets to implement this before me, feel free.

    def get_descendants(self, page_id):
        """
        generate a list of descendants

        Queries the configured Confluence instance for a set of descendants for
        the provided `page_id` or (if set to `None`) the configured space.

        Args:
            page_id: the ancestor to search on (if not `None`)

        Returns:
            the descendants
        """
        descendants = set()
        path = ''

        if page_id:
            path = f'content/{page_id}/descendant/page'
            search_fields = {}
        else:
            path = 'content/search'
            search_fields = {'cql': 'space="' + self.space_key +
                '" and type=page'}

        # Configure a larger limit value than the default (no provided
        # limit defaults to 25). This should reduce the number of queries
        # needed to fetch a complete descendants set (for larger sets).
        search_fields['limit'] = 1000

        rsp = self.rest_client.get(path, search_fields)
        idx = 0
        while rsp['size'] > 0:
            for result in rsp['results']:
                descendants.add(result['id'])
                self._name_cache[result['id']] = result['title']

            if rsp['size'] != rsp['limit']:
                break

            idx += int(rsp['limit'])
            sub_search_fields = dict(search_fields)
            sub_search_fields['start'] = idx
            rsp = self.rest_client.get(path, sub_search_fields)

        return descendants

jdknight commented 1 year ago

Querying for descendants in Confluence, both Cloud and non-Cloud variants, has not always provided consistent results. This includes both through the content/search and content/{id}/descendant[/page] calls where pages are either missing or deleted pages still exist.

I believe the original development for purging used content/{id}/descendant[/page], but after various testing, it appeared using content/search gave more consistent results -- granted, this was five years ago. I would be hesitant to switch the API without completely understanding why there are inconsistencies.

We do have a non-documented configuration confluence_adv_aggressive_search which has shown to provide more consistent results, by using additional API calls to check for descendants per page. If you do not mind, could you try it out in your environment:

confluence_adv_aggressive_search = True

Maybe we should just enable this capability by default to provide a more reliable user experience. And users which may have extremely large data sets that do not want want to perform additional API calls, we provide a documented configuration option to opt out.

jdknight commented 1 year ago

@twoodhouse, I forgot to mention, thanks for reporting issues with finding descendants with your environment. I took some time to test with various documentation sets with Confluence to see how "descendant finding" compared with using the page-specific content-descendants call (content/{id}/descendant[/page]) versus the search variant (content/search). It appears that the descendant call seemed to be working as expected, which restored some confidence in the API call for me. Although trying both variants, they both seemed to give expected results for finding descendants. IMHO, I feel both API calls have a certain risk to them, depending on how a Confluence instance managing tracking/caching descendants. It is hard to know the best approach to take for all environments. However, we for sure do not want to be in a state where a user cannot properly cleanup descendants without a workaround.

To deal with all this, this extension has now introduced a confluence_cleanup_search_mode option (to be available v2.1). This should allow users to override how this extension searches for descendants, in the odd case where Confluence is not playing as expected. In addition, this change also changes the default way we search for descendants. We have moved back to using the direct descendant search (content/{id}/descendant[/page]), since it should be the "right" way to look for descendants. Hopefully next release, users using cleanup operations do not exhibit issues. If they do, they will have the option to tweak their configurations to use the search mode instead. And if it appears to be more common to have missing descendants in the next release, this extension will most likely search the default from direct to search for a following release.

jdknight commented 1 year ago

sphinx-contrib / confluencebuilder

Ancestor lookup fails on Confluence Cloud #758