r-lib / commonmark

High Performance CommonMark and Github Markdown Rendering in R
https://docs.ropensci.org/commonmark/
Other
88 stars 11 forks source link

How to handle empty lines in md #8

Open maelle opened 6 years ago

maelle commented 6 years ago

I am trying to parse this Markdown file

It's full of empty lines due to knitr rendering it from Rmd I guess. On GitHub it renders well. But when I try to parse it I cannot get the structure that's in the .Rmd: the table is either separated in different blocks, or if I remove empty lines, it gets glued to the rest of the README.

rmd <- "https://raw.githubusercontent.com/ropensci/drake/master/README.Rmd"

md <- "https://raw.githubusercontent.com/ropensci/drake/master/README.md"

library("magrittr")
rmd %>%
  readLines() %>%
  commonmark::markdown_xml(extensions = TRUE) %>%
  xml2::read_xml()
#> {xml_document}
#> <document xmlns="http://commonmark.org/xml/1.0">
#>  [1] <thematic_break/>
#>  [2] <heading level="2">\n  <text>output:</text>\n  <softbreak/>\n  <tex ...
#>  [3] <html_block>&lt;!-- README.md is generated from README.Rmd. Please  ...
#>  [4] <code_block info="{r knitrsetup, echo = FALSE}">knitr::opts_chunk$s ...
#>  [5] <code_block info="{r mainexample, echo = FALSE}">suppressMessages(s ...
#>  [6] <html_block>&lt;center&gt;\n&lt;img src="https://ropensci.github.io ...
#>  [7] <html_block>&lt;table class="table"&gt;&lt;thead&gt;&lt;tr class="h ...
#>  [8] <heading level="1">\n  <text>The drake R package </text>\n  <html_i ...
#>  [9] <paragraph>\n  <code>drake</code>\n  <text> — or, Data Frames in R  ...
#> [10] <heading level="1">\n  <text>What gets done stays done.</text>\n</h ...
#> [11] <paragraph>\n  <text>Too many data science projects follow a </text ...
#> [12] <list type="ordered" start="1" delim="period" tight="true">\n  <ite ...
#> [13] <paragraph>\n  <text>It is hard to avoid restarting from scratch.</ ...
#> [14] <html_block>&lt;center&gt;\n&lt;a href="https://twitter.com/fossilo ...
#> [15] <paragraph>\n  <text>With </text>\n  <code>drake</code>\n  <text>,  ...
#> [16] <list type="ordered" start="1" delim="period" tight="true">\n  <ite ...
#> [17] <heading level="1">\n  <text>How it works</text>\n</heading>
#> [18] <paragraph>\n  <text>To set up a project, load your packages,</text ...
#> [19] <code_block info="{r mainpackages}">library(drake)\nlibrary(dplyr)\ ...
#> [20] <paragraph>\n  <text>load your custom functions,</text>\n</paragraph>
#> ...

md %>%
  readLines() %>%
  commonmark::markdown_xml(extensions = FALSE) %>%
  xml2::read_xml()
#> {xml_document}
#> <document xmlns="http://commonmark.org/xml/1.0">
#>  [1] <html_block>&lt;!-- README.md is generated from README.Rmd. Please  ...
#>  [2] <html_block>&lt;center&gt;\n</html_block>
#>  [3] <html_block>&lt;img src="https://ropensci.github.io/drake/images/in ...
#>  [4] <html_block>&lt;/center&gt;\n</html_block>
#>  [5] <html_block>&lt;table class="table"&gt;\n</html_block>
#>  [6] <html_block>&lt;thead&gt;\n</html_block>
#>  [7] <html_block>&lt;tr class="header"&gt;\n</html_block>
#>  [8] <html_block>&lt;th align="left"&gt;\n</html_block>
#>  [9] <paragraph>\n  <text>Release</text>\n</paragraph>
#> [10] <html_block>&lt;/th&gt;\n</html_block>
#> [11] <html_block>&lt;th align="left"&gt;\n</html_block>
#> [12] <paragraph>\n  <text>Usage</text>\n</paragraph>
#> [13] <html_block>&lt;/th&gt;\n</html_block>
#> [14] <html_block>&lt;th align="left"&gt;\n</html_block>
#> [15] <paragraph>\n  <text>Development</text>\n</paragraph>
#> [16] <html_block>&lt;/th&gt;\n</html_block>
#> [17] <html_block>&lt;/tr&gt;\n</html_block>
#> [18] <html_block>&lt;/thead&gt;\n</html_block>
#> [19] <html_block>&lt;tbody&gt;\n</html_block>
#> [20] <html_block>&lt;tr class="odd"&gt;\n</html_block>
#> ...

md %>%
  readLines() %>%
  .[. != ""] %>%
  commonmark::markdown_xml(extensions = FALSE) %>%
  xml2::read_xml()
#> {xml_document}
#> <document xmlns="http://commonmark.org/xml/1.0">
#> [1] <html_block>&lt;!-- README.md is generated from README.Rmd. Please e ...
#> [2] <html_block>&lt;center&gt;\n&lt;img src="https://ropensci.github.io/ ...

Created on 2018-09-04 by the reprex package (v0.2.0).

maelle commented 6 years ago

For context, I'm trying to parse READMEs that GitHub considers to be the preferred README https://developer.github.com/v3/repos/contents/#get-the-readme and anyway I must be missing something, surely if GitHub can render this table there is a way for me to correctly parse the Markdown file. 🤔

maelle commented 6 years ago

possibly related https://github.com/commonmark/CommonMark/issues/490

maelle commented 6 years ago

For my very specific use case I'll use regex to extract the html of the 1st table but it seems suboptimal of course!