get_child_nodes("pages", ...) tries to load chapters as pages

Snaptraks commented 10 months ago

I am trying to use the exporter but I encountered a problem where it tries to get a page by ID, but is actually using a chapter ID and returns a 404 and crashes. The method is https://github.com/homeylab/bookstack-file-exporter/blob/main/bookstack_file_exporter/exporter/exporter.py#L143

The book in question is not in a shelf, contains 4 Chapters, some without pages.

I modified NodeExporter._get_children in exporter.py to print the parent's children, and the offending one looks like

{'book_id': 1,
 'created_at': '2023-12-06T13:43:48.000000Z',
 'id': 3,
 'name': 'Embrace2',
 'pages': [{'book_id': 1,
            'chapter_id': 3,
            'created_at': '2023-12-08T20:10:06.000000Z',
            'draft': False,
            'id': 13,
            'name': 'Embrace 2',
            'priority': 1,
            'slug': 'embrace-2',
            'template': False,
            'updated_at': '2023-12-08T20:10:16.000000Z',
            'url': '<base_url>/books/devices/page/embrace-2'}],
 'priority': 1,
 'slug': 'embrace2',
 'type': 'chapter',
 'updated_at': '2023-12-06T13:44:36.000000Z',
 'url': '<base_url>/books/devices/chapter/embrace2'}

and the child_url is <base_url>/api/pages/3

Notice how it is using the chapter ID to get a page from the API, a page that was probably deleted when setting up the Bookstack isntance.

The traceback I am getting is

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\p0129085\AppData\Local\miniconda3\envs\bookstack\Scripts\bookstack-file-exporter.exe\__main__.py", line 7, in <module>
  File "C:\Users\p0129085\AppData\Local\miniconda3\envs\bookstack\Lib\site-packages\bookstack_file_exporter\__main__.py", line 12, in main
    run.exporter(args)
  File "C:\Users\p0129085\AppData\Local\miniconda3\envs\bookstack\Lib\site-packages\bookstack_file_exporter\run.py", line 36, in exporter
    page_nodes: Dict[int, Node] = export_helper.get_all_pages(book_nodes)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\p0129085\AppData\Local\miniconda3\envs\bookstack\Lib\site-packages\bookstack_file_exporter\exporter\exporter.py", line 147, in get_all_pages
    page_nodes: Dict[int, Node] = self.get_child_nodes("pages", book_nodes)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\p0129085\AppData\Local\miniconda3\envs\bookstack\Lib\site-packages\bookstack_file_exporter\exporter\exporter.py", line 84, in get_child_nodes
    return self._get_children(base_url, parent_nodes, filter_empty)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\p0129085\AppData\Local\miniconda3\envs\bookstack\Lib\site-packages\bookstack_file_exporter\exporter\exporter.py", line 98, in _get_children
    child_node = Node(child_data, parent)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\p0129085\AppData\Local\miniconda3\envs\bookstack\Lib\site-packages\bookstack_file_exporter\exporter\node.py", line 38, in __init__
    self.name: str = self.meta['slug']
                     ~~~~~~~~~^^^^^^^^
KeyError: 'slug'

pchang388 commented 10 months ago

First, thanks again for raising the issue and contributing @Snaptraks!

I'm still looking into this and I have an example created in my own instance (Bookstack latest: v23.10.4) as so:

Now I can see this in the /books API:

I then call the /chapters/4 API for a chapter in that specific book with no shelf:

JSON response

```json { "id": 4, "book_id": 45, "slug": "test-chapter-no-shelve", "name": "TEST CHAPTER NO SHELVE", "description": "", "priority": 3, "created_at": "2023-12-15T10:16:44.000000Z", "updated_at": "2023-12-15T10:16:44.000000Z", "created_by": { "id": 3, "name": "user", "slug": "user" }, "updated_by": { "id": 3, "name": "user", "slug": "user" }, "owned_by": { "id": 3, "name": "user", "slug": "user" }, "book_slug": "test-book-no-shelve", "tags": [], "pages": [ { "id": 109, "book_id": 45, "chapter_id": 4, "name": "TEST PAGE IN CHAPTER NO SHELF", "slug": "test-page-in-chapter-no-shelf", "priority": 1, "created_at": "2023-12-15T10:16:45.000000Z", "updated_at": "2023-12-15T10:17:03.000000Z", "created_by": 3, "updated_by": 3, "draft": false, "revision_count": 1, "template": false, "owned_by": 3, "editor": "", "book_slug": "test-book-no-shelve" } ] } ```

Your specific error is saying that the slug key does not exist in the Bookstack API response but according to the response in your example, it does have slug key as embrace2:

  File "C:\Users\p0129085\AppData\Local\miniconda3\envs\bookstack\Lib\site-packages\bookstack_file_exporter\exporter\node.py", line 38, in __init__
    self.name: str = self.meta['slug']
                     ~~~~~~~~~^^^^^^^^
KeyError: 'slug'

And that seems odd since from my understanding, every resource (chapter/page/book/shelf) should have a slug key even if the value is empty. It's possible that slug may not be the best fit for this use case if it does not exist in certain cases like yours or the exporter is using some unexpected call to Bookstack API and trying to use the response as a chapter/page/book/shelf resource.

If you could, could you try and add a log/print for self.meta on that problem function in the exporter\node.py like so?

## if you want to add logger from `__main__.py`
# import logging
# log = logging.getLogger(__name__)

class Node():
    .....
    def __init__(self, meta: Dict[str, Union[str, int]],
                 parent: Union['Node', None] = None, path_prefix: str = ""):
        print(meta)
        # log.info(meta)
        self.meta = meta

I'm curious to see what the response looks like from the one that raises the KeyError: 'slug' error and if we need to switch to a different key for asset naming. So far this looks like an edge case I haven't encountered before.

Regardless I'll mess around more and see if I can replicate your issue. I can also work on an updated version that adds some better debug logging to make it easier to identify the problem

pchang388 commented 10 months ago

Sorry accidently closed, reopened!

Snaptraks commented 10 months ago

Of course @pchang388 ! Thank you for taking the time to look into it, you did a wonderful job already with the project. My error occurs because the code looks at a Chapter with ID 3, and then creates a URL for a page with ID 3:

in exporter.py:

    def _get_children(
        self, base_url: str, parent_nodes: Dict[int, Node], filter_empty: bool
    ) -> Dict[int, Node]:
        child_nodes = {}
        for _, parent in parent_nodes.items():
            if parent.children:
                for child in parent.children:
                    # if child.get('type') == "chapter":
                    #     continue
                    child_id = child["id"]
                    child_url = f"{base_url}/{child_id}"
                    child_data = self._get_json_response(child_url)
                    print("exporter._get_children, child:")
                    pprint(child)
                    print(f"{child_url=}")
                    child_node = Node(child_data, parent)
                    if filter_empty:
                        if not child_node.empty:
                            child_nodes[child_id] = child_node
                    else:
                        child_nodes[child_id] = child_node
        return child_nodes

and in node.py:

    def __init__(
        self,
        meta: Dict[str, Union[str, int]],
        parent: Union["Node", None] = None,
        path_prefix: str = "",
    ):
        print("node.Node.__init__, meta:")
        pprint(meta)
        self.meta = meta
        self._parent = parent
        self._path_prefix = path_prefix
        # for convenience/usage for exporter
        self.name: str = self.meta["slug"]
        self.id_: int = self.meta["id"]
        self._display_name = self.meta["name"]
        # children
        self._children = self._get_children()
        # if parent
        self._file_path = self._get_file_path()

which create this in the console, right before the exception:

exporter._get_children, child:
{'book_id': 1,
 'created_at': '2023-12-06T13:43:48.000000Z',
 'id': 3,
 'name': 'Embrace2',
 'pages': [],
 'priority': 1,
 'slug': 'embrace2',
 'type': 'chapter',
 'updated_at': '2023-12-06T13:44:36.000000Z',
 'url': 'http://10.54.56.121:6875/books/devices/chapter/embrace2'}

child_url='http://10.54.56.121:6875/api/pages/3'

node.Node.__init__, meta:
{'error': {'code': 404, 'message': 'Page not found'}}
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\p0129085\AppData\Local\miniconda3\envs\bookstack\Scripts\bookstack-file-exporter.exe\__main__.py", line 7, in <module>
  File "C:\Users\p0129085\Documents\bookstack-file-exporter\bookstack_file_exporter\__main__.py", line 12, in main
    run.exporter(args)
  File "C:\Users\p0129085\Documents\bookstack-file-exporter\bookstack_file_exporter\run.py", line 36, in exporter
    page_nodes: Dict[int, Node] = export_helper.get_all_pages(book_nodes)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\p0129085\Documents\bookstack-file-exporter\bookstack_file_exporter\exporter\exporter.py", line 164, in get_all_pages
    page_nodes: Dict[int, Node] = self.get_child_nodes("pages", book_nodes)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\p0129085\Documents\bookstack-file-exporter\bookstack_file_exporter\exporter\exporter.py", line 95, in get_child_nodes
    return self._get_children(base_url, parent_nodes, filter_empty)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\p0129085\Documents\bookstack-file-exporter\bookstack_file_exporter\exporter\exporter.py", line 112, in _get_children
    child_node = Node(child_data, parent)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\p0129085\Documents\bookstack-file-exporter\bookstack_file_exporter\exporter\node.py", line 46, in __init__
    self.name: str = self.meta["slug"]
                     ~~~~~~~~~^^^^^^^^
KeyError: 'slug'

So the meta parameter is just a "error 404", since in my instance there are no pages with ID 3, the ID of the chapter.

I'm happy to help further if more information is needed

pchang388 commented 10 months ago

Thanks again for the extra info! I took another look and yes it does appear to be a Chapter being treated as a page as you stated

NodeExporter.get_all_pages

  def get_all_pages(self, book_nodes: Dict[int, Node]) -> Dict[int, Node]:
      if book_nodes:
          page_nodes: Dict[int, Node] = self.get_child_nodes("pages", book_nodes)

In that function, there is no check as you stated and that's something that I missed since it didn't cause an issue on my instance. Since I am getting the chapter nodes later, I should filter out for pages only since that is our intent. I adjusted the part and added a check as you pointed out

    def get_all_pages(self, book_nodes: Dict[int, Node]) -> Dict[int, Node]:
        """get all pages and their content"""
        ## pages
        page_nodes = {}
        if book_nodes:
            # add `page` flag, we only want pages
            # filter out chapters for now
            # chapters can have their own children/pages
            page_nodes: Dict[int, Node] = self.get_child_nodes("pages", book_nodes, node_type="page")
        ## chapters (if exists)
        # chapter nodes are treated a little differently
        # chapters are children under books
        chapter_nodes: Dict[int, Node] = self.get_chapter_nodes(book_nodes)

In the requests method, I also added some clarity for the issue. We already do retries and raise on 50X but forgot about scenarios like this one.

    try:
        # raise_for_status() throws an exception on codes 400-599
        response.raise_for_status()
    except requests.exceptions.HTTPError as e:
        # this means it either exceeded 50X retries in `http_get_request` handler
        # or it returned a 40X which is not expected
        log.error("Bookstack request failed with status code: %d on url: %s",
                   response.status_code, url)
        raise e

Also, unless there are pages in the chapters, they will be ignored since we export at the page level. Thanks again for catching this and providing helpful information to pinpoint the issue!

I'll push version a 1.0.2 for you to try if you get a chance to try and validate!

pchang388 commented 10 months ago

@Snaptraks - the auto action from merge/commit closed this issue.

I reopened for now in case you get some time to validate, thanks!

Snaptraks commented 10 months ago

I tested v1.0.2 on my problematic instance and it worked flawlessly! Thank you again for your time, you made my life so much easier with your tool :)

homeylab / bookstack-file-exporter

get_child_nodes("pages", ...) tries to load chapters as pages #30