What is the best way to extract tables from markdown ?

ishandutta2007 commented 10 months ago

I mean I am looking for something like this:

const md = `

some blabla text
Here is a table:

| a | b |
|---|---|
| 1 | 2 |
| ^ | 3 |

more text outside

`;

const tables = getMarkdownTables(md)
console.log(tables[0].header)
console.log(tables[0].rows.count)

wataru-chocola commented 10 months ago

You can use mdast-extended-table in this repo, which converts markdown into mdast table. But be aware that mdast table doesn't distinguish table header from table rows.

Or you can just convert markdown to html, and then get DOM table or something by parsing it.

import { remarkExtendedTable, extendedTableHandlers } from './lib/index.js';

import { unified } from 'unified';
import remarkParse from 'remark-parse';
import remarkRehype from 'remark-rehype';
import rehypeStringify from 'rehype-stringify';
import remarkGfm from 'remark-gfm';

const process = (md, options, gfmOptions) =>
  unified()
    .use(remarkParse)
    .use(remarkGfm, gfmOptions)
    .use(remarkExtendedTable, { ...options, ...gfmOptions })
    .use(remarkRehype, { handlers: extendedTableHandlers })
    .use(rehypeStringify)
    .process(md);

const md = `
| a | b | c |
|---|---|---|
| > | 1 | 2 |
| ^ | ^ | 3 |
`;

const tableHtml = (await process(md)).value;
const parser = new DOMParser();
const parsedHTML = parser.parseFromString(tableHtml, 'text/html');
const table = parsedHTML.getElementsByTagName('table')[0];

ishandutta2007 commented 10 months ago

@wataru-chocola Thanks for the quick reply. Well but it returns me empty table. I can't find any <p> or <div> tags in parsedHTML either.

Ideally html parsing should do the job but the issue I found with this approach is most of the time the markdown table is parsed as paragraph element than a table element. This is my hackish approach. My hope is with remark-extended-table it might be solvable.

ishandutta2007 commented 10 months ago

@wataru-chocola btw I didn't run your code as is, had to change unified.process() to unified.parse() to get it working (I am not sure why it is required only for me , I got the tip from here). Just mentioning that if it has anything to do with empty parsedHTML

wataru-chocola commented 10 months ago

Ideally html parsing should do the job but the issue I found with this approach is most of the time the markdown table is parsed as paragraph element than a table element. This is my hackish approach.

It sounds like remark-gfm or my plugin doesn't work. Can you give me your package.json?

btw I didn't run your code as is, had to change unified.process() to unified.parse() to get it working

To the best of my knowledge, unified.parse() only converts some text into AST representation (e.g. mdast). unified.process() parses text, then converts AST into another text representation. (like markdown -> mdast -> hast -> HTML)

ishandutta2007 commented 9 months ago

Can you give me your package.json?


  "dependencies": {
    ..........
    ..........
    "rehype-highlight": "^6.0.0",
    "rehype-katex": "^7.0.0",
    "rehype-sanitize": "^5.0.1",
    "rehype-stringify": "^9.0.3",
    "remark-breaks": "^3.0.3",
    "remark-gfm": "^3.0.1",
    "remark-math": "^5.1.1",
    "remark-parse": "^10.0.2",
    "remark-rehype": "^10.1.0",
    "remark-supersub": "^1.0.0",
    "unified": "^10.1.2",
}

hope improper version combination turns out to be the root cause.

To the best of my knowledge, unified.parse() only converts some text into AST representation (e.g. mdast).

Well then maybe you can suggest a better fix, I have a stackoverflow question posted , none have suggested anything better yet.

wataru-chocola commented 9 months ago

If you use the latest version of remark-extended-table, it has peerDependencies as following:

  "peerDependencies": {
    "remark-gfm": "^4.0.0",
    "remark-parse": "^11.0.0",
    "unified": "^11.0.0"
  },

unified, remark-something made some breaking changes on their major release. This code doesn't work with old unified version.

wataru-chocola / remark-extended-table

What is the best way to extract tables from markdown ? #139