oleast / contentful-rich-text-html-parser

Convert just about any HTML document to the Contentful Rich Text format
https://www.npmjs.com/package/contentful-rich-text-html-parser
MIT License
5 stars 0 forks source link

Handling Spaces and Line Breakers in HTML input #188

Open amirhbakan opened 3 weeks ago

amirhbakan commented 3 weeks ago

When I pass the following HTML:

<h2>Explore the Wonders of the Ocean</h2>

<p>The ocean is a vast and mysterious place, teeming with life and beauty. From the smallest plankton to the largest whales, the marine ecosystem is incredibly diverse.</p>

It converts it to an invalid Contentful rich text and creates blocks with node type text as root because of \n characters. it return error when you pass it to Contentful management API.

Contentful line breaker error

I solved it by removing all the spaces and line breakers between HTML tags, at the start and end of my input and I believe library should have an option to choose between ignoring spaces and line breakers(by minifying HTML) or converting them to valid <br/> tags behaviour. Even if it's not needed to have an extra option, at least solve it by implementing one of options expressed above and then pass it to converter.

Steps to Reproduce:

Actual Behavior: The HTML is converted to an invalid Contentful rich text structure with blocks having node type text as root due to \n characters. when I pass it Contentful management API, it return error said the block with node type of text is invalid.

Suggested Solution: Add an option to ignore or convert spaces and line breakers to something works or minify the HTML before converting it to the Contentful rich text data structure.

oleast commented 2 weeks ago

Good catch, and thanks for the detailed issue, @amirhbakan!

If I'm catching the issue correctly here the problem is just the text-nodes at the root of the document. Which means any text or whitespace at root of the original HTML-input would result in the same problem? E.g.:

<h2>Heading</h2>
Text outside an element
<p>Text inside an element</p>

My understading is that we can't turn all \n characters into <br />, because newlines are a valid part of text inside an HTML-document, and does not necesseraly serve the same purpose as <br />. If that behavior is intended by a user of the library, it should be possible to do it with a custom converter function.

Since the problem we're dealing with is text and whitespace that is at the root of an HTML-document (outside any HTML-tag), we're actually dealing with invalid HTML? But since we're dealing with what might be partial HTML-snippets, we'll have to handle it anyway.

It would be possible to just add an option to ignoreWhitespaceAtDocumentRoot or something like that. But it seems like a very specific solution. Could also just do that by default without an option, since the resulting document would be invalid anyway. That would however not solve the problem with other text nodes at the document root.

I'll have to think about a solution a bit, thanks for the input!

oleast commented 2 weeks ago

Ok, since we're specifically dealing with Contentful here, maybe we could just wrap all text nodes at the root level of the document in a paragraph block?

If we reverse the situation, the Contentful Rich Text editor would never let you write text-nodes at the root level, they'll always be wrapped in a paragraph-node.

amirhbakan commented 2 weeks ago

@oleast Thanks for your reply, as I mentioned in the issue, I solved it in my code with removing all the spaces and \n between HTML tags from my input. generally in HTML files all the line breaks and spaces between closed tags are ignored then when I pass it to a library I expect same behaviour. if I really need to have a line break there I put a <br/> or I put a <p></p> and a line break inside of it. to be aligned with general behaviour of HTML files I recommend to ignore all these \n.