smarinier / importer

Import from Text, HTML and EverNote export for NextCloud
GNU General Public License v3.0
2 stars 0 forks source link

<br/> not properly converted to markdown when inside <pre></pre> tags #2

Open olegme opened 7 months ago

olegme commented 7 months ago

Me again ;-)

After I got it running, I stumbled upon a few quirks, mainly related to handling things inside <pre></pre> tags.

The first thing, I could isolate is the handling of \<br\/> inside <pre></pre> tags.

It is being translated to <br></br> sequence, which might be right per the HTML spec, but the proper markdown should be something like <space><space><new line> meaning two spaces followed by the new line.

Can you by any chance take a look or at least give some pointers as to where to start if I would like to try to fix it myself?

Thank you

smarinier commented 7 months ago

Hi again ;)

My notes were very basic, so i didn't encounters much troubles. If you could join here a sample, and the MD you'd think he should generate, i may have a look.

Or if you feel so, you can have a look in the League converter code, you'll that the HTML conversion made in Enex and in HTML docs (EnexConverter and HtlmConverter in the code) would need to provide a dedicated League/Environment, with getConverterByTag that would react on br (may a state to say we're in pre would be necessary), and that will provide a "Converter" that could replace by space space newline.

olegme commented 7 months ago

I dug deeper into the code and it seems the League package is a culprit. Line #175 in their HtmlConverter.php file does a good job of not converting everything between <pre></pre> tags. And there is an issue on their GitHub exactly for my case - [https://github.com/thephpleague/html-to-markdown/issues/245]. No reaction from the author so far.

In terms of the HTML standard, this is a correct approach and I think it's actually Evernote, which doesn't export it properly. To be honest, I wouldn't know how to fix it, but if you want to experiment, I attached my export file here. Templates.enex.gz

When you uncompress and open it, the very first note titled "OpenWrt Wordpress" has something like this at the very beginning: <pre><br/>mkdir -p /mnt/disk/var/www</pre>. Which converts to <br></br>mkdir -p /mnt/disk/var/www, but according to the markdown specification it has to be two spaces and then a linebreak.

There are a few more issues, I experienced with more complicated notes, but I would better open separate issues as soon as I have time for further testing.

Thank you

smarinier commented 7 months ago

Hi @olegme, i just had a look on it. I'm not so sure this is the right HTML behaviour. In fact, all HTML tags must be escaped, whatever the tag they are placed in. Any HTML Code must be escaped (by something like > <)

So i tried just by commenting the three lines in HtmlConverter.php (around line 175), and it seems much more better to me.

If you may try with this, you'll see the files from your sample being more readeable. If this is ok for you, i can subclass the converter and/or try to propose the change in the library (with an option i guess).

The commented file HtmlConverter.php as joined file here (but i'm sure you can do it yourself)

HtmlConverter.php.zip

Please send your feedbacks about this,

olegme commented 6 months ago

Hi @smarinier,

thank you for the feedback. I'll test the attached class and let you know.

Regards

smarinier commented 6 months ago

Hi @olegme,

Since my previous message, i worked on it. I've handled yours needs and the samples given in the issue form the PHP Library. I subclassed some objects form this library as the needed changes for this library are very important (to my opinion). As a major point, it's said in a comment "keep HTML code in

" : this is against HTML rules.
I'm finishing the new implementation soon and i will invite you to test it ;)

smarinier commented 1 month ago

Hi @olegme It's been quite a long time. But we moved this last monthes and that was a huge job (before, during and after).

At least, i've pushed what I've told previously. All my tests are based on your Templates.enex. Thanks for it.

Please let me know if the conversion looks of for you now