sphinx-contrib / confluencebuilder

Confluence Markup Builder Plugin for Sphinx
BSD 2-Clause "Simplified" License
317 stars 98 forks source link

Replay HTTP requests against an air-gapped Confluence server #823

Open oxbqkwwxfrqccwtg opened 1 year ago

oxbqkwwxfrqccwtg commented 1 year ago

Hi all,

i am currently facing the situation where i need to publish to an air-gapped Confluence server inside a virtualised and privatised environment (over Windows VDI). In addition, company policy forbids me from using, or installing Python on the VDI. I can freeload a perl executable that came bundled with Git for Windows, but there are no other scripting means besides Windows PowerShell.

So... I'm either thinking about hacking something in PowerShell or Perl.

Quick-n-Nasty Approach

  1. Figure out what data confluencebuilder is pulling from the target server and what data is used in the confluencebuilder output
  2. write a small pass-through wrapper for all HTTP calls made against the bundled requests package. (i'd use unittest.mock or something similar)
  3. index all POST, PUT, etc. HTTP request objects and write a generic replay sequence
  4. Replay all HTTP requests against the air-gapped Confluence server

Quick-n-Dirty Approach

Basically just like the Quick-n-Nasty solution except that i will rewrite the ConfluencePublisher class

I just require this as an interims workaround, since i'll be getting access to proper CI/CD facilities soon.

My question

Was a scenario, as described by me, ever considered during the development? I'd be interested in the thoughts.

It's probably not worth adding it as a feature, but publishing replays could be a feature, if the publishing task was split into building and publishing, where the build task could output an index of HTTP requests metadata and request bodies and the publisher executes upon the index.

jdknight commented 1 year ago

This topic was discussed in https://github.com/sphinx-contrib/confluencebuilder/issues/627 last year, if you would like to read up about the opinions/issues at the time.

Note that part of the hesitation outlined in these comments were due to some prospect implementation in supporting Confluence' ADF format (in part to support the v2 editor style). The implementation would require a multi-pass publishing design, otherwise Confluence would reject uploads. Since this extension was able to support the v2 editor while still using the storage format, publishing should be as simple as it is right now in the current implementation. So, when considering the other issue's comments, it may be somewhat easier to implement a separation between building and publishing.

While my first impressions think that a replay functionality should not be something directly implemented in this extension, I do think it's a great idea for your use case. Isolating request data made should be easily translatable to PowerShell (or whatever) scripts. While the suggested approach of wrapping on the requests library should work, I wonder if it would be simpler if this extension supported an advanced configuration or introduce extension hooks which allows a user to override either the ConfluencePublisher or Rest classes without needing to use mock implementation (but I guess it doesn't really matter what is used; whatever works for this use case). This Confluence builder extension is always willing to create hooks or support overrides from the configuration for existing features, if power users want an easier way to perform custom tweaks.

The biggest thing to consider when looking at the separation between building and publishing is not so much the generated documentation set, but the dynamic values needed in some of the data requests. We could initially ignore some features like purging old pages or publishing on updated pages/attachments, and just focus on publishing an entire documentation/attachment set. One example to consider is that the version's number, which on new pages is set to one (or not provided). However, a publish attempt is updating an existing document, this value must be set to the existing version number of the page plus one. This can make it hard if you are trying to do a dumb-replay publish ~ you may have to add some smarts to know a request needs an update and tweaks the version value. Alternatively, you could also just crudely wipe the entire space and publish a new set with all versions being 1.

As for the Confluence Builder extension, my current opinion on supporting the separation of building and publishing is as follows. I am in favor to add support for such a feature. I would not say it is a priority feature to get out by next release, but it could be the next big feature we aim to bring in this extension. The following overview is how I would imagine such a design would be done:

flowchart LR
    P --> S
    Tpp --> S
    E --> A
    A -.-> Tpp
    A -.-> I
    subgraph Sphinx/Confluence Builder
    C[Initialization] --> B
    C --> I
    B[Builder] --> P
    B --> E
    E[Export]
    I[Import] --> P
    P[Publisher]
    end
    A[Archive]
    subgraph Third Party
    Tpp[Publisher]
    end
    subgraph "Confluence"
    S[Space]
    end

I'm welcome to other design considerations as well.

Finally, I would imagine the timelines wanted for OPs requirements may not be align with the timelines of this extension providing full support for build/publish separation. For sure it should be possible to tweak this extension in order to be flexible for a third-party replay functionality (if tweaks to this extension are needed), but a serious look at a full fledged solution may not be looked into (by this maintainer) until possible Fall, if not next year.

oxbqkwwxfrqccwtg commented 1 year ago

Marry me! 🥹 Thank you so much for such a quick and precise reply!

Alternatively, you could also just crudely wipe the entire space and publish a new set with all versions being 1.

That's exactly what i had in mind. 😄 Quick-n-Nasty!

Should this become a feature and it will be implemented similar to what @jdknight proposes, i'd volunteer for providing a reference implementation of a third-party publisher. Having a plan B is always nice, so i don't see a problem in me dedicating time for this, probably Q4 '23.

oxbqkwwxfrqccwtg commented 1 year ago

UPDATE: I've upgraded the boilerplate since the interims solution will be in place longer than i would like to. Now, the boilerplate suppresses any connection to a Confluence instance and dumps contents and attachments to seperate files (had issues with Anti-Virus when normalized). The schema is now redundant.

Attached you'll find my quick-n-dirty boilerplate implementation of the wrapper. Fixing up a custom Rest Client in PowerShell/Perl/whatever can be achieved in less than 50 lines, hence i don't see the point in sharing.

Maybe this will be useful for someone else.

@jdknight if i can support the project with any feature implementations, let me know. I feel comfortable with the code now.

In addition, i've appended a schema for the output.

By default the dump will go to sphinx.config.Config.outdir / 'confluence.out'.

One can use the schema in conjunction with e.g. the TestJSON PowerShell cmdlet.

Attachment data is decoded as ISO-8859-1.

#!/usr/bin/env python3
"""Publishment delay wrapper sphinxcontrib.confluencebuilder

This is a lightweight pass-through wrapper for 
``sphinxcontrib.confluencebuilder``, which intercepts all ``store_*`` calls on 
a ``ConfluencePublisher`` instance, dumps all data into interchange. 
The index and dumps can be used in conjunction with the PowerShell 
helper to delay/replay the publishment of pages and attachments for a different 
Confluence instance, than what the programmatic target is.

The builder name is ``x_confluence``

``Publisher``, ``Builder``, as well as ``Rest`` instances are mocked and are 
supressing any HTTP connectivity should the ``confluence_publish_dry_run`` be 
set to ``True``. 

.. warning::

    ``confluence_publish_dry_run`` MUST be set to ``True``

Content (pages) and attachments are dumped into separate files and indexed. 

The output directory can be set through ``x_confluence_outdir``.

The use-case for this implementation is as follows:

    I am currently facing the situation where i need to publish to an 
    air-gapped Confluence server inside a virtualised and privatised 
    environment (over Windows VDI). In addition, company policy forbids me from 
    using, or installing Python 
    on the VDI. I can freeload a perl executable that came bundled with Git for 
    Windows, but there are no other scripting means besides Windows PowerShell.

"""
__author__ = 'tiara.rodney@adesso.de'
__copyright__ = 'adesso SE'
__license__ = 'DL-DE-BY-2.0'

from dataclasses import dataclass, asdict
import json
from mimetypes import guess_extension
from pathlib import Path
from typing import Any, Optional, ByteString, Dict, Tuple, List
from unittest.mock import patch
from uuid import uuid4

from sphinx.application import Sphinx
from sphinx.util import logging

from sphinxcontrib.confluencebuilder import setup as _setup
from sphinxcontrib.confluencebuilder.builder import (
    ConfluenceBuilder as _ConfluenceBuilder
)
from sphinxcontrib.confluencebuilder.publisher import (
    ConfluencePublisher as _ConfluencePublisher
)
from sphinxcontrib.confluencebuilder.rest import Rest as _Rest

logger = logging.getLogger(__name__)

@dataclass
class ConfluenceContentMeta:
    """
    see
    `https://docs.atlassian.com/ConfluenceServer/rest/8.4.0/#api/content-createContent`_,
    for more information
    """
    #:
    title: str

    #: 
    ancestor_id: str

@dataclass
class ConfluenceChildAttachmentMeta:
    """

    """
    #: 
    container_id: str

    #: 
    name: str

    #: 
    mimetype: str

@dataclass
class ConfluencePublisherDump:
    """
    """
    #: 
    pages: Dict[str, ConfluenceContentMeta]

    #: 
    attachments: Dict[str, ConfluenceChildAttachmentMeta]

class Rest(_Rest):
    """
    """
    def __init__(self, config):
        """
        """

        super().__init__(config)

    def __getattr__(self, name: str) -> Any:
        """
        """
        print('Hallo')

        return super().__getattribute__(name)

    def __setattr__(self, name: str, value: Any) -> None:
        """
        """
        return super().__setattr__(name, value)

    def get(self, key, params=None):
        """
        """
        from pprint import pprint

        return {'results': [
            {
                'id': 776536065,
                'key': self.config.confluence_space_key,
                'name': 'Testitest',
                'type': 'personal'
            }
        ], 'size': 1, 'limit': 1, 'start': 0}

class ConfluencePublisher(_ConfluencePublisher):
    """
    """
    def __init__(self):
        """
        """
        super().__init__()

        self.dump = ConfluencePublisherDump(
            pages = {},
            attachments = {}
        )

    def __getattr__(self, name: str) -> Any:
        """
        """
        return super().__getattribute__(name)

    def __setattr__(self, name: str, value: Any) -> None:
        """
        """
        return super().__setattr__(name, value)

    def connect(self):
        """initialize a REST client and probe the target Confluence instance

        .. note::

            Actually, i don't want the extension to initialize a connection, 
            but there is too much entanglement, so we're mocking the absolute
            minimum for the publisher object to assume everything is fine
        """
        with patch('sphinxcontrib.confluencebuilder.publisher.Rest', Rest):

            return super().connect()

    def get_page_by_id(self, page_id, expand = 'version') -> Tuple[None, List]:
        """get page information with the provided page name

        :param page_id: the page identifier
        :param expand: data to expand on

        :returns: page id and page object
        """
        return (None, [])

    def store_attachment(
        self,
        page_id: str,
        name: str,
        data: Any,
        mimetype: Any,
        hash_: str,
        force: bool = False
    ) -> str:
        """request to store an attachment on a provided page

        :returns: the attachment identifier
        """
        logger.info('pass-through intercept: store_attachment')

        attachment_id = uuid4()

        mime_extension = guess_extension(mimetype, False)

        if mime_extension:

            attachment_id = f'{attachment_id}{mime_extension}'

        file = (Path(getattr(self.config, 'x_confluence_outdir')) / 
                    'attachments' / attachment_id)

        file.parent.mkdir(parents=True, exist_ok=True)

        file.write_bytes(data)

        self.dump.attachments[attachment_id] = ConfluenceChildAttachmentMeta(
            container_id = page_id,
            name = name,
            mimetype= mimetype
        )

        return attachment_id

    def store_page(
        self,
        page_name: str,
        data: Any,
        parent_id: Optional[str] = None
    ) -> str:
        """request to store page information to a confluence instance

        :param page_name: the page title to use on the updated page
        :param data:  the page data to apply
        :param parent_id: the id of the ancestor to use

        :returns: id of uploaded page
        """
        logger.info('pass-through intercept: store_page')

        content_id = str(uuid4())

        file = (Path(getattr(self.config, 'x_confluence_outdir')) / 'content' /
                    f'{content_id}.xml')

        file.parent.mkdir(parents=True, exist_ok=True)

        file.write_bytes(data['content'].encode('utf-8'))

        self.dump.pages[content_id] = ConfluenceContentMeta(
            title = page_name,
            ancestor_id = parent_id
        )

        return content_id

    def store_page_by_id(
        self,
        page_name: str,
        page_id: str,
        data: Any
    ) -> str:
        """request to store page information on the page with a matching id

        :param page_name: the page title to use on the updated page
        :param data:  the page data to apply
        :param parent_id: the id of the ancestor to use

        :returns: id of uploaded page
        """
        logger.info('pass-through intercept: store_page_by_id')

        return 'NULL'

    def disconnect(self):
        """terminate the REST client

        .. note::

            Freeloading this method to dump the index.
        """
        file = Path(getattr(self.config, 'x_confluence_outdir')) / 'data.json'

        file.parent.mkdir(parents = True, exist_ok=True)

        raw = json.dumps(asdict(self.dump), indent=4)

        file.write_text(raw)

        logger.info(f'content dump count: {len(self.dump.pages)}')

        logger.info(f'attachments dump count: {len(self.dump.attachments)}')

        logger.info(f'dump index: {file}')

class ConfluenceBuilder(_ConfluenceBuilder):
    """
    """
    name = 'x_confluence'

    def __init__(self, app: Sphinx, env = None):
        """
        """
        patch_target = ('sphinxcontrib.confluencebuilder'
                        '.builder.ConfluencePublisher')

        with patch(patch_target, ConfluencePublisher):

            super().__init__(app, env)

    def __getattribute__(self, name: str) -> Any:
        """
        """
        return super().__getattribute__(name)

    def __setattr__(self, name: str, value: Any) -> None:
        """
        """
        return super().__setattr__(name, value)

def setup(app: Sphinx):
    """
    """
    patch_target = 'sphinxcontrib.confluencebuilder.ConfluenceBuilder'

    app.add_config_value(
        name = 'x_confluence_outdir',
        default = str(Path(app.outdir) / 'confluence.out'),
        rebuild = True
    )

    with patch(patch_target, ConfluenceBuilder):

        logger.info(f'patching: {patch_target}')

        return _setup(app)
{
    "$id": "https://github.com/tiara-adessi/confluencebuilder/schema/top",
    "x-authors": [
        "tiara.rodney@adesso.de"
    ],
    "type": "object",
    "properties": {
        "pages": {
            "type": "array",
            "items": {
                "$ref": "#/definitions/page"
            }
        },
        "attachments": {
            "type": "array",
            "items": {
                "$ref": "#/definitions/attachment"
            }
        }
    },
    "required": [
        "pages",
        "attachments"
    ],
    "definitions": {
        "page": {
            "type": "object",
            "properties": {
                "page_name": {
                    "type": "string"
                },
                "page_id": {
                    "type": "string"
                },
                "parent_id": {
                    "type": "string"
                },
                "data": {
                    "type": "object",
                    "properties": {
                        "content": {
                            "type": "string"
                        },
                        "labels": {
                            "type": "array",
                            "items": {
                                "type": "string"
                            }
                        }
                    },
                    "required": [
                        "content",
                        "labels"
                    ]
                }
            },
            "required": [
                "page_name",
                "page_id",
                "parent_id",
                "data"
            ]
        },
        "attachment": {
            "type": "object",
            "properties": {
                "page_id": {
                    "type": "string"
                },
                "name": {
                    "type": "string"
                },
                "data": {
                    "type": "string"
                },
                "mimetype": {
                    "type": "string"
                },
                "attachment_id": {
                    "type": "string"
                }
            },
            "required": [
                "page_id",
                "name",
                "data",
                "mimetype",
                "attachment_id"
            ]
        }
    }
}
jdknight commented 1 year ago

Thanks for the example, provides some good insights to understand what is desired.

I'm curious about the schema definition, since I cannot say I've created a schema definition file for JSON before. Noticed there is the use of type, properties, required, etc. which looks to be able to support a way to programmatically understand the schema definition. Is this based off a common practice (e.g. using these specific keys)? Manually created or was this created with something else?

oxbqkwwxfrqccwtg commented 1 year ago

That was a very nice subtle hint, that i was missing the $schema keyword (though it's not mandatory). 😄

Yes, these are common practices as defined in the JSONSchema specification(s).

There are some pretty good validators out there (e.g., I use this one in Python, and this one in Node.js, and this one in Perl). The nice thing about Microsoft PowerShell is, that they have a built-in Newtonsoft validator through the Test-JSON cmdlet, which is available for PowerShell Core and Desktop. So Microsoft seems to be leading the way into making JSONSchema actually more practical (it's stuck in the forever-draft dimension)

Yes, the schema was created manually and i tend to use older versions of the specification to author them, so that it stays compatible across multiple validators.

Besides the $ref reference resolution (the specification is vague about that, therefore there are multiple understandings and implementation approaches of that) it is pretty straight forward.

Btw. i refactored the code once more as it turns out, that my temporary solution probably has to stay long-term as the promised CI/CD environment won't suffice. I'll share the repos some day next week.

oxbqkwwxfrqccwtg commented 1 year ago

Sorry for the delayed (promised) notice.

We now have published two programs and made this an open-source effort:

xconfluencebuilder generates the manifest (including referenced content) of archived pages/attachments and PSConfluencePublisher does the publishing (also works with PowerShell 5 [Desktop]). PSConfluencePublisher is now also able to do unidirectional synchronization and caches all metadata locally as to reduce the amount of network chatter.

We're hoping the reference implementations suffice for you giving it a test drive.

The manifest schema (currently only part of the PSConfluencePublisher repo) has been adapted to:

{
    "$id": "https://spec.victory-k.it/psconfluencepublisher.json",
    "x-authors": [
        "theodor.rodweil@victory-k.it"
    ],
    "type": "object",
    "properties": {
        "Pages": {
            "type": "array",
            "item": {
                "$ref": "#/definitions/page"
            }
        },
        "Attachments": {
            "type": "array",
            "item": {
                "$ref": "#/definitions/attachment"
            }
        }
    },
    "required": [
        "Pages",
        "Attachments"
    ],
    "definitions": {
        "page": {
            "type": "object",
            "description": "Local Confluence page/container attachment metadata",
            "properties": {
                "Title": {
                    "type": "string",
                    "description": "Title of page"
                },
                "Id": {
                    "type": "string",
                    "description": "Id of attachment defined by Confluence instance. The id is generated after the publishing of a page."
                },
                "Version": {
                    "type": "string"
                },
                "Hash": {
                    "type": "string",
                    "description": "SHA512 hexadecimal content hash value"
                },
                "Ref": {
                    "type": "string",
                    "description": "Local filesystem reference/path"
                },
                "AncestorTitle": {
                    "type": "string",
                    "description": "Title of Confluence page this page is a child of. The title must be a property key of the pages object."
                }
            },
            "required": [
                "Title",
                "Ref"
            ]
        },
        "attachment": {
            "type": "object",
            "description": "Local Confluence page/container attachment metadata",
            "properties": {
                "Name": {
                    "type": "string",
                    "description": "name of attachment, which must be unique within the container page"
                },
                "Id": {
                    "type": "string",
                    "description": "Id of attachment defined by Confluence instance. The id is generated after the publishing of an attachment."
                },
                "Hash": {
                    "type": "string",
                    "description": "SHA512 hexadecimal attachment content hash value"
                },
                "MimeType": {
                    "type": "string",
                    "description": "MIME type of attachment",
                    "default": "binary/octet-stream"
                },
                "ContainerPageTitle": {
                    "type": "string",
                    "description": "Title of Confluence page this attachment is contained in. The title must be a property key of the pages object."
                },
                "Ref": {
                    "type": "string",
                    "description": "Local filesystem reference/path"
                }
            },
            "required": [
                "Name",
                "Hash",
                "MimeType",
                "ContainerPageTitle",
                "Ref"
            ]
        }
    }
}