hunyadi / md2conf

Publish Markdown files to Confluence wiki
MIT License
56 stars 31 forks source link

Directories no longer being recursed into #62

Closed kedoodle closed 2 months ago

kedoodle commented 2 months ago

Hi there, it looks like a recent change may have introduced a regression with recursive syncing of a directory. Currently only finding files in the provided mdpath directory (not sub directories).

Set up:

mkdir -p tmp-reproduce-regression
cd tmp-reproduce-regression
mkdir -p sub-dir-1
mkdir -p sub-dir-2
echo '<!-- confluence-page-id: 1 -->' > root-file-1.md
echo '<!-- confluence-page-id: 2 -->' > root-file-2.md
echo '<!-- confluence-page-id: 3 -->' > root-file-3.md
echo '<!-- confluence-page-id: 4 -->' > sub-dir-1/nested-file-4.md
echo '<!-- confluence-page-id: 5 -->' > sub-dir-2/nested-file-5.md

Should produce something like:

$ tree
.
├── root-file-1.md
├── root-file-2.md
├── root-file-3.md
├── sub-dir-1
│   └── nested-file-4.md
└── sub-dir-2
    └── nested-file-5.md

3 directories, 5 files

Running md2conf:

$ docker run --rm -v "$(pwd):/workdir" leventehunyadi/md2conf:latest \
    -d foo.atlassian.net \
    -u USER_NAME \
    -a API_KEY \
    -s FOO \
    --ignore-invalid-url \
    --local \
    /workdir
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
2024-10-02 02:00:47,595 - INFO - process_directory [43] - Synchronizing directory: /workdir
2024-10-02 02:00:47,596 - INFO - _index_directory [71] - Indexing directory: /workdir
2024-10-02 02:00:47,599 - INFO - process_directory [48] - indexed 3 page(s)

Expected to index 5 pages, not 3.

hunyadi commented 2 months ago

Strange. Do you see the same symptoms if you install the extension from PyPI and run the command md2conf directly (rather than via Docker)? I have a test case that synchronizes a directory tree and it appears to work nicely.

hunyadi commented 2 months ago

It appears the newly introduced Matcher class was eagerly filtering directory entries based on extension mismatch, which resulted in the processor never recursing into directories. Matcher was extended to handle directories separately, and unit tests have been added to verify that nested directories are properly visited. Improvements have been pushed to the branch master.

Thanks for reporting the issue!

kedoodle commented 2 months ago

Hi @hunyadi, thank you for the prompt fix! Yes, I was seeing the issue with running md2conf directly, although it was still inside a container (python:3.11-slim base) as it's in a pipeline.

Is it possible to get a container image with the fix built? Happy to help with release strategy e.g. GitHub release -> CI publishes md2conf PyPI and container image.

ojacques commented 1 month ago

@hunyadi I still have the issue. Can this be re-opened?

No folder structure created

docker run --rm -v "$(pwd)/test-md2conf:/workdir" leventehunyadi/md2conf:latest \
    -r $1 \
    --domain my-confluence \
    --path '/' \
    --space 'MYSPACE' \
    --apikey "$CONFLUENCE_API_TOKEN" \
    --ignore-invalid-url \
    /workdir

with test-md2conf (attached)

test-md2conf.zip

test-md2conf
|-- README.md
|-- folder-A
|   |-- folder-AA
|   |   |-- README.md
|   |   `-- another-page-aa.md
|   `-- folder-AB
|       |-- README.md
|       `-- another-page-ab.md
`-- folder-B
    |-- README.md
    `-- another-page-b.md

I get the logs :

2024-10-23 17:47:39,480 - INFO - synchronize_directory [54] - Synchronizing directory: /workdir
2024-10-23 17:47:39,480 - INFO - _index_directory [94] - Indexing directory: /workdir
2024-10-23 17:47:39,480 - INFO - page_exists [471] - Checking if page exists with title: README
2024-10-23 17:47:39,563 - INFO - create_page [438] - Creating page: README
2024-10-23 17:47:39,677 - INFO - _index_directory [94] - Indexing directory: /workdir/folder-B
2024-10-23 17:47:39,678 - INFO - page_exists [471] - Checking if page exists with title: another-page-b
2024-10-23 17:47:39,709 - INFO - create_page [438] - Creating page: another-page-b
2024-10-23 17:47:39,839 - INFO - page_exists [471] - Checking if page exists with title: README
2024-10-23 17:47:39,918 - INFO - _index_directory [94] - Indexing directory: /workdir/folder-A
2024-10-23 17:47:39,918 - INFO - _index_directory [94] - Indexing directory: /workdir/folder-A/folder-AA
2024-10-23 17:47:39,918 - INFO - page_exists [471] - Checking if page exists with title: another-page-aa
2024-10-23 17:47:39,950 - INFO - create_page [438] - Creating page: another-page-aa
2024-10-23 17:47:40,073 - INFO - page_exists [471] - Checking if page exists with title: README
2024-10-23 17:47:40,148 - INFO - _index_directory [94] - Indexing directory: /workdir/folder-A/folder-AB
2024-10-23 17:47:40,148 - INFO - page_exists [471] - Checking if page exists with title: README
2024-10-23 17:47:40,208 - INFO - page_exists [471] - Checking if page exists with title: another-page-ab
2024-10-23 17:47:40,243 - INFO - create_page [438] - Creating page: another-page-ab
2024-10-23 17:47:40,375 - INFO - synchronize_directory [64] - indexed 7 page(s)
2024-10-23 17:47:40,375 - INFO - _synchronize_page [77] - Synchronizing page: /workdir/README.md
2024-10-23 17:47:40,458 - INFO - update_page [418] - Updating page: 2009315992
2024-10-23 17:47:40,585 - INFO - _synchronize_page [77] - Synchronizing page: /workdir/folder-B/another-page-b.md
2024-10-23 17:47:40,636 - INFO - update_page [418] - Updating page: 2009315993
2024-10-23 17:47:40,764 - INFO - _synchronize_page [77] - Synchronizing page: /workdir/folder-B/README.md
2024-10-23 17:47:40,820 - INFO - update_page [418] - Updating page: 2009315992
2024-10-23 17:47:40,941 - INFO - _synchronize_page [77] - Synchronizing page: /workdir/folder-A/folder-AA/another-page-aa.md
2024-10-23 17:47:40,991 - INFO - update_page [418] - Updating page: 2009315994
2024-10-23 17:47:41,122 - INFO - _synchronize_page [77] - Synchronizing page: /workdir/folder-A/folder-AA/README.md
2024-10-23 17:47:41,175 - INFO - update_page [418] - Updating page: 2009315992
2024-10-23 17:47:41,296 - INFO - _synchronize_page [77] - Synchronizing page: /workdir/folder-A/folder-AB/README.md
2024-10-23 17:47:41,342 - INFO - update_page [418] - Updating page: 2009315992
2024-10-23 17:47:41,458 - INFO - _synchronize_page [77] - Synchronizing page: /workdir/folder-A/folder-AB/another-page-ab.md
2024-10-23 17:47:41,511 - INFO - update_page [418] - Updating page: 2009315995

But no folder structure:

image

Error when scanning directory, outside of Docker

Also, when I run from md2conf which is installed locally (pip install / no container),

python3 -m md2conf -r $1 \
    --domain my-confluence \
    --path '/' \
    --space 'MYSPACE' \
    --apikey "$CONFLUENCE_API_TOKEN" \
    --ignore-invalid-url \
    ./test-md2conf

I get

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/git/docs-md2conf/venv/lib/python3.11/site-packages/md2conf/__main__.py", line 211, in <module>
    main()
  File "/home/user/git/docs-md2conf/venv/lib/python3.11/site-packages/md2conf/__main__.py", line 196, in main
    ).synchronize(args.mdpath)
      ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/git/docs-md2conf/venv/lib/python3.11/site-packages/md2conf/application.py", line 40, in synchronize
    self.synchronize_directory(path)
  File "/home/user/git/docs-md2conf/venv/lib/python3.11/site-packages/md2conf/application.py", line 63, in synchronize_directory
    self._index_directory(local_dir, root_id, page_metadata)
  File "/home/user/git/docs-md2conf/venv/lib/python3.11/site-packages/md2conf/application.py", line 125, in _index_directory
    self._index_directory(Path(local_dir) / directory, parent_id, page_metadata)
  File "/home/user/git/docs-md2conf/venv/lib/python3.11/site-packages/md2conf/application.py", line 100, in _index_directory
    for entry in os.scandir(local_dir):
                 ^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'test-md2conf/test-md2conf/folder-B'

It looks like the recursive call is incorrect.

Thanks!

hunyadi commented 1 month ago

Indeed. When passing the argument to the recursive call, the current directory (as per the directory traversal) was prepended to the nested directory path to scan. Unfortunately, the nested directory path had already been an absolute path. Python's pathlib.Path allows you to combine two absolute paths, in which case it is a no-op, the second path prevails. Unit tests use absolute paths, which is why this issue went unnoticed. However, when you pass a relative directory as the directory to publish (e.g. from the command line), the concatenation is not a no-op but produces the wrong path as in your error report:

FileNotFoundError: [Errno 2] No such file or directory: 'test-md2conf/test-md2conf/folder-B'

This issue has been fixed in the latest commit to the branch master. Thanks for reporting the issue!

ojacques commented 1 month ago

Thanks @hunyadi! I checked with the latest on master, and the problem with double path is gone. But I still don't have the folder structure with the example I attached above. Any idea?

hunyadi commented 1 month ago

Did you delete the pages from Confluence and remove the Confluence page IDs from the Markdown files? Once a Confluence page has been created and a page ID has been assigned, md2conf doesn't adjust parent-child relationships. (This allows us to update pages that are e.g. in the same source directory but scattered across spaces or parent pages in Confluence. This is a feature we are extensively relying on.) md2conf has a function called api.create_page that passes the JSON field ancestors to Confluence, which is how the parent is set. Other functions don't set ancestors. The log above appears to have

update_page [418] - Updating page: 2009315992

which indicates a call to api.update_page. A future version of md2conf could set ancestors for existing pages but this is not implemented yet.

ojacques commented 1 month ago

Did you delete the pages from Confluence and remove the Confluence page IDs from the Markdown files?

Yes. I also deleted the root page and recreate a new one every time.

I am not sure I understand if there is a limitation, or something I should change to get this to work. 😬

hunyadi commented 1 month ago

Thanks for the prompt feedback! I have pushed more extensive changes as part of the commit Fix an issue with index page not assuming parent role. My own manual tests confirm that it's now working as expected (but had the same issue as what you reported prior to the changes). In the future, I will make these automated tests to avoid regressions.