vsch / flexmark-java

CommonMark/Markdown Java parser with source level AST. CommonMark 0.28, emulation of: pegdown, kramdown, markdown.pl, MultiMarkdown. With HTML to MD, MD to PDF, MD to DOCX conversion modules.
BSD 2-Clause "Simplified" License
2.21k stars 260 forks source link

HTML to Markdown table converter #576

Closed alexxkovalchuk closed 8 months ago

alexxkovalchuk commented 1 year ago

Converting HTML table to markdown is not working properly. (0.64.8)

HTML table:

<table>
  <tr>
    <td></td>
    <td><strong>Hello</strong></td>
    <td><strong>World</strong></td>
  </tr>
  <tr>
    <td><strong>a</strong></td>
    <td>a1</td>
    <td>a2</td>
  </tr>
  <tr>
    <td><strong>b</strong></td>
    <td>b1</td>
    <td>b2</td>
  </tr>
  <tr>
    <td><strong>c</strong></td>
    <td>c1</td>
    <td>c2</td>
  </tr>
</table>

Current result:

|-------|-----------|-----------|
|       | **Hello** | **World** |
| **a** | a1        | a2        |
| **b** | b1        | b2        |
| **c** | c1        | c2        |

Desired result:

|       | **Hello** | **World** |
|-------|-----------|-----------|
| **a** | a1        | a2        |
| **b** | b1        | b2        |
| **c** | c1        | c2        |

Without a <th> html tag it will not work.

To Reproduce

String table = "<table>\n" +
                "  <tr>\n" +
                "    <td></td>\n" +
                "    <td><strong>Hello</strong></td>\n" +
                "    <td><strong>World</strong></td>\n" +
                "  </tr>\n" +
                "  <tr>\n" +
                "    <td><strong>a</strong></td>\n" +
                "    <td>a1</td>\n" +
                "    <td>a2</td>\n" +
                "  </tr>\n" +
                "  <tr>\n" +
                "    <td><strong>b</strong></td>\n" +
                "    <td>b1</td>\n" +
                "    <td>b2</td>\n" +
                "  </tr>\n" +
                "  <tr>\n" +
                "    <td><strong>c</strong></td>\n" +
                "    <td>c1</td>\n" +
                "    <td>c2</td>\n" +
                "  </tr>\n" +
                "</table>";

String md = FlexmarkHtmlConverter.builder().build().convert(table);
ghost commented 1 year ago

I don't think it's a good idea to do that converter because it precludes the possibility of making a table where the first row has two bold items. If you want to do preprocessing, I'd suggest making a custom XSLT tranformer that can correct the HTML to introduce the <th></th> semantic markup.

Similarly, if the HTML produced by the markdown tables you've shown doesn't include <th></th> then that problem should be fixed (i.e., generate semantically correct HTML from Markdown tables).