leoncvlt / loconotion

📄 Python tool to turn Notion.so pages into lightweight, customizable static websites
838 stars 132 forks source link

Support for new Notion URL format #69

Open bhchiang opened 3 years ago

bhchiang commented 3 years ago

Not 100% sure, but I believe the URL format for Notion shared pages recently changed.

It's now notion.site instead of notion.so:

Editing view: https://www.notion.so/bryanchiang/Bryan-Chiang-fc01c67a1ed9402e83eb8efd5c99a216 Shared view: https://bryanchiang.notion.site/Bryan-Chiang-fc01c67a1ed9402e83eb8efd5c99a216

I get a parser error with the second one.

Ito-MacBook:loconotion bryanhpchiang$ python3 loconotion https://www.notion.so/bryanchiang/fc01c67a1ed9402e83eb8efd5c99a216
[23:09:54] INFO Initialising parser with simple page url
[23:09:54] INFO Setting output path to 'dist/bryanchiang/fc01c67a1ed9402e83eb8efd5c99a216'
[23:09:54] INFO Initialising chromedriver at /usr/local/lib/python3.9/site-packages/chromedriver_autoinstaller/91/chromedriver
[23:09:56] INFO Parsing page 'https://www.notion.so/bryanchiang/fc01c67a1ed9402e83eb8efd5c99a216'
[23:10:57] CRITICAL Timeout waiting for page content to load, or no content found. Are you sure the page is set to public?
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.9/3.9.4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/Cellar/python@3.9/3.9.4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/bryanhpchiang/Documents/Workspace/loconotion/loconotion/__main__.py", line 144, in <module>
    main()
  File "/Users/bryanhpchiang/Documents/Workspace/loconotion/loconotion/__main__.py", line 123, in main
    Parser(config=config, args=vars(args))
  File "/Users/bryanhpchiang/Documents/Workspace/loconotion/loconotion/notionparser.py", line 85, in __init__
    self.run(url)
  File "/Users/bryanhpchiang/Documents/Workspace/loconotion/loconotion/notionparser.py", line 667, in run
    f"Finished!\n\nProcessed {len(tot_processed_pages)} pages in {formatted_time}"
TypeError: object of type 'NoneType' has no len()

Will trying modifying the check for a valid notion.so website.

tomreitz commented 3 years ago

@bryanhpchiang it looks like the page you linked isn't publicly shared... unless you recently un-shared it, that would explain why loconotion cannot load the content.

There's definitely an issue here though, since shared pages have the new Notion URL format of https://example.notion.site/Page-1F29BC48EA1A029FC481B but sub-pages still have the old format https://notion.so/example/Page-1F29BC48EA1A029FC481B which is a redirect page.

For me, loconotion correctly loads the primary public Notion page URL, but times out on any subpages. I think the logic at lines 582-584 of loconotion/notionparser.py needs to be updated to rewrite page URLs from the old format to the new one before attempting to fetch them.

leoncvlt commented 3 years ago

@tomreitz merged a pull request from @bryanhpchiang earlier today which should address this, want to pull it and check it's all good?

tomreitz commented 3 years ago

@leoncvlt thanks for the quick response (and an awesome project!). Subpages still not working for me, see this public page which converts fine, but the subpages in the table time out, per the logs below

[21:47:02] INFO Initialising parser with configuration file
[21:47:02] INFO Setting output path to 'dist/wiwebsites.com'
[21:47:02] INFO Initialising chromedriver at /usr/bin/chromedriver
[21:47:03] INFO Parsing page 'https://tomreitz.notion.site/Wisconsin-Websites-ecdb3dc4cd1e40f280b7512a23ca2006'
[21:47:17] INFO Downloading 'https://www.notion.so/print.b31f28aa.css'
[21:47:17] INFO Downloading 'https://www.notion.so/app-7d82edb35207a8a8b776.css'
[21:47:18] INFO Downloading 'https://www.notion.so/lyon-text-regular-3be84b20b1d9ff1e3456b0a220ae449b.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/lyon-text-regular-italic-437d32a42fc5b8268bb4a1e0cc8b363f.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/lyon-text-semibold-acb7f110189034ff6a1afa4b730be0ed.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/lyon-text-semibold-italic-1f81a2f93060f05edd7f078ac91f25e6.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/iawriter-mono-regular-4b73d071988a4f1cd2283524716ad970.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/iawriter-mono-italic-d5d3224c1377168e261efc6aa0ce89c6.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/iawriter-mono-bold-eb96a5e539892d26cf8b0cb2367e3580.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/iawriter-mono-bold-italic-743b231fa82483406c79a00fa1f12fe8.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/inter-ui-regular-3ae6a7d3890c33d857fc00bd2e4c4820.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/inter-ui-medium-95b8a98959d1af9ab432d7ffe295ef94.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/inter-ui-semibold-19b57197b819695d334b9961ee41910e.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/inter-ui-bold-001893789f7f342b520f29ac8af7d6ca.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/permanent-marker-a6d62939e7c920a184ddddcf4149e62c.woff'
[21:47:18] INFO Downloading 'https://www.notion.so/katex/katex.88defe76.min.css'
[21:47:18] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_AMS-Regular.342a61e0.ttf'
[21:47:18] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Caligraphic-Bold.b27e354b.ttf'
[21:47:18] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Caligraphic-Regular.bd18bae2.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Fraktur-Bold.359e1e97.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Fraktur-Regular.6b53a2db.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Main-Bold.ed829b5f.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Main-BoldItalic.ca23ba4b.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Main-Italic.14ff9c98.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Main-Regular.c89c6436.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Math-BoldItalic.7b481bb8.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Math-Italic.f677173e.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_SansSerif-Bold.362d94c6.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_SansSerif-Italic.2c742978.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_SansSerif-Regular.6087fc04.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Script-Regular.781730b2.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Size1-Regular.54a80b37.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Size2-Regular.24cbe093.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Size3-Regular.ee3e5bf4.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Size4-Regular.b78c75bb.ttf'
[21:47:19] INFO Downloading 'https://www.notion.so/katex/fonts/KaTeX_Typewriter-Regular.90f78c10.ttf'
[21:47:19] INFO Exporting page 'https://tomreitz.notion.site/Wisconsin-Websites-ecdb3dc4cd1e40f280b7512a23ca2006' as 'index.html'
[21:47:19] INFO Parsing page 'https://www.notion.so/7514e88c4042418997665b5ecf11733b?v=703812ea01fe4ee6bc010fd72be278f8'
[21:48:20] CRITICAL Timeout waiting for page content to load, or no content found. Are you sure the page is set to public?
[21:48:20] INFO Parsing page 'https://www.notion.so/80f1c747841641e2a729fb0286390da2'
[21:49:21] CRITICAL Timeout waiting for page content to load, or no content found. Are you sure the page is set to public?
[21:49:21] INFO Parsing page 'https://www.notion.so/e861fdd6a0c247ca8bad342d2cdb05b6'
[21:50:22] CRITICAL Timeout waiting for page content to load, or no content found. Are you sure the page is set to public?
[21:50:22] INFO Parsing page 'https://www.notion.so/d5c95ef2e77349e98691b8925de7d119'
[21:51:23] CRITICAL Timeout waiting for page content to load, or no content found. Are you sure the page is set to public?
[21:51:23] INFO Finished!

Processed 1 pages in 00:04:19

If you go to a subpage directly, you'll see that it is public, but is a redirect page from Notion.

bhchiang commented 3 years ago

Thanks for pointing that out - my PR doesn't handle subpages. When parsing the subpages (sub_page_href), the www.notion.so part should be replaced with {site_name}.notion.site.

I tried a quick fix but there are a few edge cases in the code that I am probably missing, so not submitting a PR yet.

EveraertJan commented 3 years ago

Hi,

I,m currently running into the same issues. Is there any fix available?

specbug commented 3 years ago

Thanks for pointing that out - my PR doesn't handle subpages. When parsing the subpages (sub_page_href), the www.notion.so part should be replaced with {site_name}.notion.site.

I tried a quick fix but there are a few edge cases in the code that I am probably missing, so not submitting a PR yet.

@bryanhpchiang can you post the partial fix here? Others can try it out and help in fixing the edge cases.

joshkmartinez commented 3 years ago

I'm having this issue as well. Would appreciate the partial fix if possible @bryanhpchiang

PiktCai commented 3 years ago

I am not a developer, but there is a quick way to make it work properly. Actually, by simply editing the links at lines 582-584 of loconotion/notionparser.py , it works. before editing:

            if sub_page_href.startswith("/"):
                sub_page_href = "https://www.notion.so" + a["href"]
            if sub_page_href.startswith("https://www.notion.so/"):
                if parse_links or not len(a.find_parents("div", class_="notion-scroller")):

after editing:

            if sub_page_href.startswith("/"):
                sub_page_href = "https://xxxx.notion.site" + a["href"]
            if sub_page_href.startswith("https://xxxx.notion.site/"):
                if parse_links or not len(a.find_parents("div", class_="notion-scroller")):

when running the program, I used python loconotion https://xxxx.notion.site/xxxx/{page-id}. Hope this would help.

sunz1e commented 3 years ago

Created a PR to Use custom new Notion url format https://xxxx.notion.site instead of default one Saw an issue where subfolder is expected in case of link of format https://xxxx.notion.site/xxxx (faced during parsing my website). Fixed that as well.

sunz1e commented 3 years ago

@bryanhpchiang could you please pull the PR and verify if its working for you as well?

bhchiang commented 3 years ago

@meSunnySrivastava

Thanks for putting together this PR. Confirming that it did work for my website to parse subpages.

The only issue is that bullet points are now missing.

image

EDIT: I see that this was supposed to be fixed by https://github.com/leoncvlt/loconotion/pull/73, and that your PR merged those changes as well.

EDIT: Deleting my dist/ + regenerating fixed the issue. The PR looks good to me, thanks!

sunz1e commented 3 years ago

Sorry I had to close the old PR because I pushed to my master directly. :)

leoncvlt commented 3 years ago

PR has been merged, thanks all!

jamesdeluk commented 3 years ago

I'm still getting the timeout issue. Exact same as the original post above.

The page is set to public:

image

The link is https://jamesdeluk.notion.site/James-IT-Notes-9969909992c04b5ba3a734cdf0a74530

(The Copy Link button gives https://www.notion.so/jamesdeluk/James-IT-Notes-9969909992c04b5ba3a734cdf0a74530, which forwards to the above).

jamesdeluk commented 3 years ago

Thought I'd try this again with the new Notion update. A couple things:

Trying to access the .site page itself fails:

image

And webdrive.log loops this:

[1632288655.436][INFO]: Waiting for pending navigations...
[1632288655.437][INFO]: Done waiting for pending navigations. Status: ok
[1632288655.445][INFO]: Waiting for pending navigations...
[1632288655.447][INFO]: Done waiting for pending navigations. Status: ok
[1632288655.447][INFO]: [edc259a3fc220da0c2d6ba0789803d04] RESPONSE FindElements [  ]
[1632288655.957][INFO]: [edc259a3fc220da0c2d6ba0789803d04] COMMAND FindElements {
   "using": "css selector",
   "value": ".notion-presence-container"
}
leoncvlt commented 3 years ago

Well, that's not gonna work regardless because you're not logged in, so the script is unable to find the notion-presence-container div which is present on every notion page - it's gonna work with public pages only.

jamesdeluk commented 3 years ago

That's my confusion though. I am logged in, and the page is public.

leoncvlt commented 2 years ago

Just checking @leshchenko1979, is this fixed by #92?

2m commented 2 years ago

I am using current master version of loconotion with the new style URLs and it seems to work fine: https://github.com/2m/nemunasring/blob/main/nemunasring.toml#L2

sueszli commented 1 year ago

Since Notion updated all URLs for hosted pages (see: https://github.com/leoncvlt/loconotion/issues/134) this ticket is no longer an enhancement, but a permanent bug.

We resolved it in our fork here: https://github.com/sueszli/notionSnapshot/