processwire / processwire-issues

ProcessWire issue reports.
44 stars 2 forks source link

TinyMCE Paste Filter doesn't carry over formatting from MS Word or Google Docs #1866

Open jlahijani opened 8 months ago

jlahijani commented 8 months ago

When using TinyMCE with the default paste filter settings, it doesn't include the formatting when pasting from MS Word or Google Docs.

However it does work correctly when copying and pasting from something like the rich text editor here (which I'm telling clients to paste to then copy from as a temporary work-around): https://html-cleaner.com/

ryancramerdesign commented 8 months ago

@jlahijani I don't have MS Word, but did try with Google Docs. I wasn't able to duplicate it. When a copy/paste a GoogleDocs document full of headlines, lists and links, they all come through properly in TinyMCE. Is it possible that the document you are copying/pasting is using [for example] font-sizes rather than the editor for things like headlines?

jlahijani commented 8 months ago

With the update to 3.0.235 (which updated TinyMCE to the newest version), it looks like the issues with MS Word (in Windows, didn't test macOS) are now fixed.

Google Docs is still imperfect. I will look into it further.

jlahijani commented 8 months ago

@ryancramerdesign I made a video demonstrating a bug with pastefilter and some other considerations when pasting from MS Word:

https://www.youtube.com/watch?v=wMmmzeVi4cY

jlahijani commented 8 months ago

I made a couple mistakes in my video:

  1. I kept saying "unordered" list when I meant to say "ordered" list. However the same applies to both.
  2. In Word, I have text that says "Headline 3", when it's been actually set to be formatted as H2.
jlahijani commented 7 months ago

Diving deeper into this... it seems based on some quick research there doesn't exist some sort of open-source JS code that handles this stupid age-old problem. However I experimented with ChatGPT to write the necessary JS and it looks promising.

This is what I got with two very basic prompts which were:

Can you write a javascript function that cleans up messy code when pasting from word and converts it to clean html? ... How about unordered lists?

Result...

function cleanWordHTML(input) {
    let doc = new DOMParser().parseFromString(input, 'text/html');

    // Remove Word's "Mso" classes and style attributes
    doc.querySelectorAll('[class^="Mso"], [class^="mso"], [style]').forEach(el => {
        el.removeAttribute('class');
        el.removeAttribute('style');
    });

    // Handle unordered lists
    // Word often uses <p> or other tags with special styles or classes for lists
    doc.querySelectorAll('p[style*="mso-list"]').forEach(p => {
        let ul = document.createElement('ul');
        p.parentNode.insertBefore(ul, p);

        while (p && p.getAttribute('style') && p.getAttribute('style').indexOf('mso-list') !== -1) {
            let li = document.createElement('li');
            while (p.firstChild) {
                li.appendChild(p.firstChild);
            }

            ul.appendChild(li);
            let nextP = p.nextElementSibling;
            p.parentNode.removeChild(p);
            p = nextP;
        }
    });

    // Remove remaining <span> tags but keep their content
    doc.querySelectorAll('span').forEach(span => {
        let parent = span.parentNode;
        while (span.firstChild) parent.insertBefore(span.firstChild, span);
        parent.removeChild(span);
    });

    // Return the cleaned HTML
    return doc.body.innerHTML;
}

// Usage example
document.getElementById('yourContentEditableDiv').addEventListener('paste', (event) => {
    event.preventDefault();
    const text = (event.clipboardData || window.clipboardData).getData('text/html');
    const cleanHTML = cleanWordHTML(text);
    document.execCommand('insertHTML', false, cleanHTML);
});

Anyway, that is to say the tricky stuff with regards to detecting a list and wrapping it in a ul tag... GPT knows how to program that and probably all other sillyness with Word formatting which may be helpful.

Remember it can simply be asked to convert it to jQuery style as well.

jlahijani commented 7 months ago

One other library that may be helpful is Summernote Cleaner, which is a 3rd party plugin for Summernote rich text editor. I'm sure their cleaner is pretty advanced although I have tested it. May be worth looking into:

https://github.com/DiemenDesign/summernote-cleaner

ryancramerdesign commented 7 months ago

Thanks @jlahijani That video was helpful. While I don't have MS Word to duplicate the issue, I was able to copy the Word markup out of your video and substitute it in pasteFilter to see how it would clean it up. I found that it cleaned it up reasonably well but left the conditional comments and <o:p> tags, and it didn't convert the bold or italic tags like you mentioned. I have made some updates which should fix all of that... at least it did in my testing here. Can you confirm that it also fixes it there?

ryancramerdesign commented 7 months ago

Btw, I don't think we can do anything with the word ordered/unordered lists, as it's MS Word that's converting them to <p> elements, and without any begin/end list tags present, we can't very easily convert it to a proper ul/ol list. But the latest pasteFilter update leaves them as just <p>List item</p> values, so it's a simple matter at that point to just select which items should be in the list and then click the UL or OL icon to convert it when needed.

ryancramerdesign commented 7 months ago

Regarding the other conversion methods, those rely on having the markup in the DOM. In our case, we are operating on the raw HTML/text, as that's what TinyMCE gives us, plus it's probably not safe to place into the DOM at this stage. Once TinyMCE inserts it into the editor, we could always go back and manipulate as DOM elements, which is possible, but probably outside the scope of the pasteFilter.

jlahijani commented 7 months ago

Regarding the other conversion methods, those rely on having the markup in the DOM. In our case, we are operating on the raw HTML/text, as that's what TinyMCE gives us, plus it's probably not safe to place into the DOM at this stage. Once TinyMCE inserts it into the editor, we could always go back and manipulate as DOM elements, which is possible, but probably outside the scope of the pasteFilter.

That's a good point and one I didn't consider.

I will test the changes a bit further when time permits as well as Google Docs (and a little more Word). I will also provide the raw HTML that gets pasted so you don't have to rewrite that by hand.

jlahijani commented 7 months ago

I made two videos about Google Docs: https://www.youtube.com/watch?v=qDbRsOYGvBk https://www.youtube.com/watch?v=VO5SGquoXEc

Raw code video 1:

<meta charset="utf-8"><b id="docs-internal-guid-39578836-7fff-ffe8-df71-0199fecdd34e"><p dir="ltr"><span>This is </span><span>bold</span><span> text.</span></p><br /><p dir="ltr"><span>This is normal text but </span><span>this is italic</span><span>.</span></p><br /><p dir="ltr"><span>A line</span></p><p dir="ltr"><span>Another line without hitting enter twice.</span></p><br /><p dir="ltr"><span>What about </span><span>bold italic</span><span>?</span></p><h2 dir="ltr"><span>This is headline 2.</span></h2><br /><p dir="ltr"><span>This is a bullet list:</span></p><br /><ul><li dir="ltr" aria-level="1"><p dir="ltr" role="presentation"><span>one</span></p></li><li dir="ltr" aria-level="1"><p dir="ltr" role="presentation"><span>two is italic</span></p></li><li dir="ltr" aria-level="1"><p dir="ltr" role="presentation"><span>three</span></p></li></ul><br /><p dir="ltr"><span>Another line of text.</span></p></b>

Raw code video 2:

<meta charset="utf-8"><b id="docs-internal-guid-e372d8f2-7fff-6b68-3080-4c08a524fa8d"><p dir="ltr"><span>bla bla bla&nbsp;</span></p><br /><p dir="ltr"><span>this is a line of text, then the [enter] key is pressed</span></p><p dir="ltr"><span>here is the second line</span></p><br /><p dir="ltr"><span>this is a line of text, then the [shift+enter] keys are pressed</span><span><br /></span><span>here is the second line</span></p><br /><p dir="ltr"><span>bla bla bla</span></p></b>
ryancramerdesign commented 7 months ago

@jlahijani Thanks. Just looking at the first example to start. But here is the input markup from Google Docs. It's strange because it doesn't seem like there's any bold or italic retained in it, and instead the entire batch of markup is wrapped in a <b> tag. So it looks like any info about bold or italic was removed prior to pasteFilter even seeing it?

  1. Original input

    <meta charset="utf-8">
    <b id="docs-internal-guid-39578836-7fff-ffe8-df71-0199fecdd34e">
    <p dir="ltr"><span>This is </span><span>bold</span><span> text.</span></p>
    <br />
    <p dir="ltr"><span>This is normal text but </span><span>this is italic</span><span>.</span></p>
    <br />
    <p dir="ltr"><span>A line</span></p>
    <p dir="ltr"><span>Another line without hitting enter twice.</span></p>
    <br />
    <p dir="ltr"><span>What about </span><span>bold italic</span><span>?</span></p>
    <h2 dir="ltr"><span>This is headline 2.</span></h2>
    <br />
    <p dir="ltr"><span>This is a bullet list:</span></p>
    <br />
    <ul>
        <li dir="ltr" aria-level="1">
            <p dir="ltr" role="presentation"><span>one</span></p>
        </li>
        <li dir="ltr" aria-level="1">
            <p dir="ltr" role="presentation"><span>two is italic</span></p>
        </li>
        <li dir="ltr" aria-level="1">
            <p dir="ltr" role="presentation"><span>three</span></p>
        </li>
    </ul>
    <br />
    <p dir="ltr"><span>Another line of text.</span></p>
    </b>
  2. Here it is after pasteFilter has been applied:

    <strong>
    <p>This is bold text.</p>
    <br>
    <p>This is normal text but this is italic.</p>
    <br>
    <p>A line</p>
    <p>Another line without hitting enter twice.</p>
    <br>
    <p>What about bold italic?</p>
    <h2>This is headline 2.</h2>
    <br>
    <p>This is a bullet list:</p>
    <br>
    <ul>
        <li>
            <p>one</p>
        </li>
        <li>
            <p>two is italic</p>
        </li>
        <li>
            <p>three</p>
        </li>
    </ul>
    <br>
    <p>Another line of text.</p>
    </strong>

And here it is after TinyMCE inserts it into the editor. Meaning, it's gone through TinyMCE's content filtering rules, which disalllow things like block level elements wrapped with inline elements, which is why the <strong> is gone, but it used it in the empty paragraphs:

<p>This is bold text.</p>
<p><strong>&nbsp;</strong></p>
<p>This is normal text but this is italic.</p>
<p><strong>&nbsp;</strong></p>
<p>A line</p>
<p>Another line without hitting enter twice.</p>
<p><strong>&nbsp;</strong></p>
<p>What about bold italic?</p>
<h2>This is headline 2.</h2>
<p><strong>&nbsp;</strong></p>
<p>This is a bullet list:</p>
<p><strong>&nbsp;</strong></p>
<ul>
    <li>
        <p>one</p>
    </li>
    <li>
        <p>two is italic</p>
    </li>
    <li>
        <p>three</p>
    </li>
</ul>
<p><strong>&nbsp;</strong></p>
<p>Another line of text.</p>

The part that we've got some control over is what converts the original input (1) to 2 above. But it looks to me like we might have a garbage-in-garbage-out scenario here, at least with regard to the bold and italic.

I'll have a look at the second bit of code next.

ryancramerdesign commented 7 months ago

@jlahijani Here's the same data for example 2:

  1. Original input
    <meta charset="utf-8">
    <b id="docs-internal-guid-e372d8f2-7fff-6b68-3080-4c08a524fa8d">
    <p dir="ltr"><span>bla bla bla&nbsp;</span></p><br />
    <p dir="ltr"><span>this is a line of text, then the [enter] key is pressed</span></p>
    <p dir="ltr"><span>here is the second line</span></p><br />
    <p dir="ltr">
    <span>this is a line of text, then the [shift+enter] keys are pressed</span>
    <span><br /></span><span>here is the second line</span>
    </p><br />
    <p dir="ltr"><span>bla bla bla</span></p>
    </b>
  2. After pasteFilter:
    <strong>
    <p>bla bla bla </p><br>
    <p>this is a line of text, then the [enter] key is pressed</p>
    <p>here is the second line</p><br>
    <p>
    this is a line of text, then the [shift+enter] keys are pressed<br>
    here is the second line
    </p><br>
    <p>bla bla bla</p>
    </strong>
  3. After TinyMCE applies its rules:
    <p>bla bla bla</p>
    <p><strong>&nbsp;</strong></p>
    <p>this is a line of text, then the [enter] key is pressed</p>
    <p>here is the second line</p>
    <p><strong>&nbsp;</strong></p>
    <p>
    this is a line of text, then the [shift+enter] keys are pressed<br>
    here is the second line
    </p>
    <p><strong>&nbsp;</strong></p>
    <p>bla bla bla</p>

    I'm thinking pasteFilter should replace </p><br> with </p>, which should hopefully prevent TinyMCE from inserting those empty paragraphs.