Closed holgerbrandl closed 7 years ago
@holgerbrandl, the paste in Markdown document tests to see if the clipboard has a "text/html" clipboard content representation. There is something in the HTML for this file that causes a bug in HTML to Markdown parser.
I don't have MS Word 15 for OS X, I will add text/html trace functionality into the next release that will dump the text before parsing HTML to the log so I can debug what is being passed in. That way I will be able to debug the HTML to Markdown converter.
I'll post an update here.
Just let me know once I should retest it. In general pasting from MS office products under MacOS seems a bit off. Like when pasting from Excel So my test table was copied and ended up totally wrong in MD even if the IJ clipboard detected a correct plain text version of my current clipboard.
But for sure broken formatting is still way better than a hard IDE crash. :-)
@holgerbrandl, the hang is due to a bug that causes an infinite loop in the parser by forgetting to skip an element which is not being recognized. I am building a version that will allow pasting HTML that would be converted to Markdown so that your paste will give the HTML that I can use to debug.
It will be useful when you don't want to convert to Markdown right away. The HTML to Markdown intention can be used to make the conversion later, giving the user a chance to fix up some HTML quirks.
If you could provide the "HTML" office generates for your use cases, using the new version, I will be able to fix (I hope) the HTML parser to recognize it for proper Markdown conversion.
My last version of MS Office for Mac was from 2009, I did not upgrade it because I don't use MS products anymore.
Your crash file when opened in LibreOffice resulted in the following HTML, note that the list item text is outside the <li></li>
tags, so HTML converter used to result in just the bullet markers. Fixed the "bug" and now converter treats p
tags in lists but not list items as a list item.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
→<meta http-equiv="content-type" content="text/html; charset=utf-8"/>
→<title></title>
→<meta name="generator" content="LibreOffice 5.2.3.3 (MacOSX)"/>
→<style type="text/css">
→→@page { margin: 2cm }
→→p { margin-bottom: 0.25cm; direction: ltr; line-height: 120%; text-align: left; orphans: 2; widows: 2 }
→</style>
</head>
<body lang="de-DE" dir="ltr">
<ul>
→<li/>
<p style="margin-bottom: 0cm; line-height: 100%"><span lang="en-US">Equipment</span></p>
→<li/>
<p style="margin-bottom: 0cm; line-height: 100%"><span lang="en-US">Chemicals</span></p>
→<li/>
<p style="margin-bottom: 0cm; line-height: 100%"><span lang="en-US">Consumables</span></p>
→<li/>
<p style="margin-bottom: 0cm; line-height: 100%"><span lang="en-US">Enzymes</span></p>
→<li/>
<p style="margin-bottom: 0cm; line-height: 100%"><span lang="en-US">GMO</span></p>
→<li/>
<p style="margin-bottom: 0cm; line-height: 100%"><span lang="en-US">Antibodies</span></p>
→<li/>
<p style="margin-bottom: 0cm; line-height: 100%"><span lang="en-US">DNA
→Constructs</span></p>
→<li/>
<p style="margin-bottom: 0cm; line-height: 100%"><span lang="en-US">RNA
→Constructs</span></p>
→<li/>
<p style="margin-bottom: 0cm; line-height: 100%"><span lang="en-US">Vectors</span></p>
→<li/>
<p style="margin-bottom: 0cm; line-height: 100%"><span lang="en-US">Oligos</span></p>
</ul>
</body>
</html>
I'm not sure if you should even try to overcome issues in text/html representations. No parser in the world will cover all imaginable versions of broken html. The example above looks like one of those broken cases to me. Falling back to plain-text pasting in case txt/html representation is not well-formed (with respect to e.g. w3c markup validation) seems totally legit to me.
@holgerbrandl, I agree. Next version will have this disabled. However, it is very convenient to have it working, especially from the browser. So now it is a configurable option that can be enabled when needed. Detecting when the HTML is too malformed to use is also difficult.
EAP available with two new settings that disable text/html detection on the clipboard disabled by default and allow to disable the HTML to markdown conversion. If you could please post the HTML content for the hang from Word and also the table copy/paste from Excel I will address these.
New Settings:
Add: Markdown application settings for:
Use clipboard text/html content when available
disabled by default, enabling it will allow
pasting text/html when availableConvert HTML content to Markdown
enabled by default, disabling will paste text/html
content without conversion to Markdown@holgerbrandl, was able to get this duplicated by installing my older version of MS Office. EAP updated please let me know if your issues have been resolved.
flexmark-java library used for parsing updated with fixes for the hang and the incorrect table parsing from Excel:
Fix: #76, HTML to Markdown hangs if comments included in Text nodes
Add: MS-Word generated HTML list basic recognition: 1.
, 1)
, A.
, A)
, a.
, a)
, IV.
, IV)
, iv.
, iv)
Fix: MS-Excel generated HTML table parsing bug
Fix: replace regex used for extracting HTML comments from HTML blocks to manual search. RegEx would go into an infinite loop on MS Word created HTML.
@holgerbrandl, now this is the result from Excel:
Need to have HTML automatically converted to markdown since it has a lot of blank lines which break up the HTML blocks if it is inserted as HTML into markdown. Select Convert HTML content to Markdown
in settings/preferences to get this on paste:
The table header needs to be moved manually using Move Line Up
action in the IDE. Excel does not use <thead></thead>
and all rows a body rows.
@holgerbrandl, here is the result of paste with conversion to Markdown of your crash file:
Numbered lists:
The word paste crash issues is solved for me as well, the inserted bits use a tab instead of a space which breaks the list rendering: However, for me it's perfect already and easy to correct with column edit
The excel table paste now brings up the "paste image" instead. I'm not sure if excel changed meanwhile to also provide an image representation or if the plugin handles the clipboard differently now.
@holgerbrandl, I will address both the tab and the image paste pop-up. I had the image dialog pop-up once but was not able to duplicate it, so assumed it was a glitch. I will ignore the image on the clipboard if HTML is enabled option is enabled and available on the clipboard.
The tabs I will convert to spaces during HTML to Markdown conversion.
@holgerbrandl, the list conversion you are seeing is standard IDE text paste not the plugin HTML to Markdown conversion. If you enable the two options:
for HTML clipboard handling then you will get:
You can convert it to a tight list using the toolbar button:
oh, now it works. Not sure why I disabled the option in the first place. Potentially to overcome the now-fixed crash bug I guess. The list compression button is very handy.
You also have an intention to clean up empty list items, sometimes HTML to Markdown can create these:
To get:
@holgerbrandl, EAP released with fix giving mime text/html clipboard content higher priority over images if Use clipboard text/html content when available
is enabled. This will take care of the image paste of Excel copied table instead of Markdown table.
I think that this version should work with Excel tables for your excel version. May need some editing for tables with a lot of cell formatting.
If your Excel table paste is not correct, please turn off the Convert HTML content to Markdown
temporarily so you can paste the actual HTML content and post it here so I can fix it. I only have Office 2011 for Mac so my HTML may differ from yours.
Thanks Vladimir for your amazing support. Both features work as described now.
It was quite a ride, but I think the issue is resolved now. :-)
Complexity of creating full featured JetBrains plugins is greatly underestimated. 😄
Not sure why, but the plugin is crashing IJ (like being not responsive anymore and requiring a forced quit) when pasting the contents from the attached word file into the markdown editor.
MacOS 10.12.3 MS Word 15.30
IntelliJ IDEA 2016.3.5 Build #IU-163.13906.18, built on March 6, 2017 Licensed to Max-Planck-Institut fuer Molekulare Zellbiologie und Genetik / holger brandl You have a perpetual fallback license for this version Subscription is active until January 25, 2018 JRE: 1.8.0_112-release-408-b6 x86_64 JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
MN version: 2.3.4.8
I can paste the same content into a plain text editor within IJ without any problems.
ij_md_crash.docx