Confluence scanning: for large sites not all spaces/pages are scanned without error

tallandtree commented 1 month ago

I use a fork of your n0s1 code to scan our (large) confluence cloud instance. Thanks for that, it is very useful.

However, I found out that not all spaces are being scanned, but I didn't get an error message or timeout. I just noticed that a test space I added was not in the report. The total scan took about 5 hours. I figured it was caused by somehow the connection being closed and the client object to become empty. I saw that you recently added error handling and did some refactoring. But the strange thing is, we didn't get errors. But I will adopt the error handling in any case. For now, I solved the issue with missing spaces by adding a self.connect() in the method 'get_data' for every batch of spaces to be collected. There might be a better way though, but for now this works.

    def set_config(self, config):
        from atlassian import Confluence
        SERVER = config.get("server", "")
        EMAIL = config.get("email", "")
        TOKEN = config.get("token", "")
        LABEL_FALSE_POSITIVE = config.get("label_false_positive", "cict-no-secrets-confirmed")
        self._url = SERVER
        self._user = EMAIL
        self._password = TOKEN
        self.label_false_positive = LABEL_FALSE_POSITIVE
        self._connect()
        return self.is_connected()

    def _connect(self):
        from atlassian import Confluence
        if self._user and len(self._user) > 0:
            self._client = Confluence(url=self._url, username=self._user, password=self._password)
        else:
            self._client = Confluence(url=SERVER, token=TOKEN)

and in get_data:

    def get_data(self, include_comments=False, test=""):
        if not self._client:
            return None, None, None, None, None, None
        start = 0
        limit = 50

        finished = False
        while not finished:
            logging.info(f"Spaces batch: {start} - {start+limit}")
            # reconnect for every batch
            self._connect()
            if not test:
                res = self._client.get_all_spaces(
                    start=start, limit=limit, expand="history"
                )
                start += limit
                spaces = res.get("results", [])
            else:
                key = test
                res = self._client.get_space(key, expand="history")
                finished = True
                spaces = [res]

I also added a possibility to only test with one space as the total scan takes such a long time via the parameter test.

For your interest, another improvement I made for our use case, is a change to the config.yaml: id: generic-api-key as we got tons of false positives due to this regex finding the confluence user macro and link macro in combination with 'key'.

  - id: generic-api-key
    description: Generic API Key
    regex: >-
      (?i)(?<!ri:user|CDATA\[\<add )(?:key|api|token|secret|client|passwd|password|auth|access)(?:[0-9a-z\-_\t
      .]{0,20})(?:[\s|']|[\s|"]){0,3}(?:=|>|:{1,3}=|\|\|:|<=|=>|:|\?=)(?:'|\"|\s|=|\x60){0,5}([0-9a-z\-_.=]{10,150})(?:['|\"|\n|\r|\s|\x60|;]|$)

And we added a method to skip a page if a label was set to indicate the page is a false positive, because the found secret is just meant as an example. In that case, the user can add a specific label to indicate that it is a false positive.

   def is_false_positive(self, page_id):
        labels_json = self._client.get_page_labels(page_id)
        labels = labels_json.get("results", [])
        for label in labels:
            if label["name"] == self.label_false_positive:
                logging.info(f"INFO: page {page_id} is false positive due to label {label}")
                return True
        return False

And in the method get_data:

                        for p in pages:
                            comments = []
                            title = p.get("title", "")
                            page_id = p.get("id", "")
                            if self.is_false_positive(page_id):
                                continue

In any case, thanks for your code. Hope my comments are useful. Kind regards, Mariska

blupants commented 3 weeks ago

Thank you for reporting the issue and for the proposed enhancement. I will add it to the next release. Did you have the chance to test your enhancements on top of the latest main branch? Does it fix your bug, or are you still having issues?

Apologies for the late response. I am back to business now, and I should be way more responsive from now on.

tallandtree commented 3 weeks ago

Hi, No problem. I've not yet had the time to test your latest version. I've planned this for the first week of September. With the reconnect I implemented, it works in any case, but I'll let you know what the results are after I've tested with your latest version again.

spark1security / n0s1

Confluence scanning: for large sites not all spaces/pages are scanned without error #26