hunyadi / md2conf

Publish Markdown files to Confluence wiki
MIT License
56 stars 31 forks source link

URL unquote #61

Closed pgsantos-pt closed 2 months ago

pgsantos-pt commented 2 months ago

Hello,

After you merged PR #58, I've notice you removed the instruction url = urllib.parse.unquote(anchor.attrib["href"]) in the first line of the md2conf.converter.ConfluenceStorageFormatConverter._transform_link method. I think this url var is local and it's only used to help converting links, correct? Therefore, the instruction that I added should be pretty harmless although I might be wrong so please let me know if I'm saying something incorrect.

Why did I add this instruction? Basically, Confluence pages need to have unique names and so, in order to avoid overwriting pages from other people in the company we usually write our titles as [ProjectID] Overview.md, for instance. Now, because of the special characters, the relative path is written (this is done automatically by the MD editor) as [Overview](%5ProjectID%5D%20Overview.md) . So, when you run the script and a log like this appears INFO - synchronize_directory [61] - indexed 2 page(s) it means that one of the pages got stored with the file name, in this case [ProjectID] Overview.md. However, when the script gets to the conversion, it finds the relative path %5ProjectID%5D%20Overview.md. When it tries to lookup %5ProjectID%5D%20Overview.md it won't find it because it was stored as [ProjectID] Overview.md. Therefore, in order to solve this problem I need that unquote instruction.

My question is, could you reintroduce that instruction or is it going to create potential problems? Maybe I was to eager and unquoted the whole href but ideally I would need the unquote at least to do the relative path lookup.

Thank you in advance.

hunyadi commented 2 months ago

As far as I am aware, if you enclose URLs in angle brackets, you can have verbatim square brackets and spaces in it:

[Overview](<[ProjectID] Overview.md>)

When rendering Markdown as HTML, this turns into a link with the right value for href:

<a href="[ProjectID] Overview.md">

The browser is going to URL-encode the href value when the referred document is fetched from a remote server.

I am always a little hesitant to apply a transformation unconditionally. Does Markdown mandate that all references are URL-encoded? If so, it is the right approach to URL-decode references. Otherwise, md2conf might decode something that is not meant to be decoded, e.g. %30 off would turn into 0 off (%30 is the code for the character 0).

We might be able to cut the Gordian knot if we URL-decode strings that comprise of URL-encoded characters only, and leave everything else as is.

pgsantos-pt commented 2 months ago

OK, let me try that first approach and I'll get back to you.

pgsantos-pt commented 2 months ago

That worked! Many thanks 🙏