phseiff / github-flavored-markdown-to-html

Convert markdown to HTML using the GitHub API and some additional tweaks with Python. Comes with full formula support and image compression.
MIT License
110 stars 6 forks source link
github markdown markdown-to-html markdown-to-pdf pypi-package python python3 static-site-generator

brought to you by phseiff github flavored markdown to html, aka gh-md-to-html

PyPI download total PyPI version shields.io PyPI pyversions GitHub license Average time to resolve an issue Percentage of issues still open

A user-friendly python-module and command-line frontend to convert markdown to html. It uses GitHubs online Markdown-to-html-API by default (which requires internet connection), but comes with an option for offline conversion (which closely imitates GitHubs behavior), and any other python- or commandline tool can be plugged into it as well. Whatever you use it with is automatically extended with a ton of functionality, like more in- and output options, github-flavored CSS, formula support, advanced image caching and optimization, host-ready file- and image-placement, pdf-conversion, emoji shortcode support, TOC support and more.

Whilst its main purpose is the creation of static pages from markdown files, for example in conjunction with a static website builder or github actions if you host on Github, it can be very well-used for any other purpose.

Here is a (not necessarily extensive) list of its advantages and features:

Advantages & Features : * Lets you specify the markdown to convert as a string, as a repository path, as a local file name or as a hyperlink. * Pulls any images referenced in the markdown files from the web/ your local storage and places them in a directory relative to your specified website root, so the resulting file structure is host-ready for static sites. Multiple arguments allow the customization of the saving locations, but the images will always be referenced correctly in the resulting html files. This is especially useful since it reflects GitHub's behavior to serve cached copies of README-images instead of linking to them directly, reducing tracking and possibly downscaling overlarge images in the process. * Creates all links as root-relative hyperlinks and lets you specify the root directory as well as the locations for css and images, but uses smart standard values for everything. * Supports inline LaTeX-formulas (use `$`-formula-`$` to use them), which GitHub usually doesn't. gh-md-to-html uses [LaTeX](https://www.tug.org/texlive/) and [dvisvgm](https://dvisvgm.de/) if they are both installed (advantage: fast, requires no internet), and otherwise the [Codecogs EqnEditor](https://latex.codecogs.com/) (advantage: doesn't require you to install 3 GB of LaTeX libraries) to achieve this. * Supports exporting to pdf with or without Github styling, using the [pdfkit](https://pypi.org/project/pdfkit/) python module (if it is installed). * Tested and optimized to look good when using [DarkReader](https://github.com/darkreader/darkreader) (the .js-module as well as the browser extension). This is especially relevant considering that DarkReader doesn't usually shift the colors of svg images, and the formulas added by gh-md-to-html's formula support are embedded as inline svg. gh-md-to-html ensured that the formulas are the same color as the text, shifted in accordance with DarkReader's current/enabled colorscheme. * Supports umlauts and other non-ascii-characters in plain text as well as in multiline code blocks, which the github REST api usually doesn't. * Allows you to choose which tool or module to use at its core for the basic markdown to html conversion. * Styles its output with github's README-css (can be turned off). * Allows you to choose a width for the box surrounding the text; this can increase readability if you intend to host the markdown file stand-alone rather than embedded into a different html file (see [#25](https://github.com/phseiff/github-flavored-markdown-to-html/issues/25) and [Wikipedia](https://en.wikipedia.org/wiki/Line_length)). * Comes with an optional support for the use of `[[_TOC_]]`, `{:toc}` and `[toc]` at the beginning of an otherwise empty line to create a table of content for the document, like GitLab-flavored markdown does, among others. * Comes with an option to compress and downscale all images referenced in the markdown file (does not affect the original images) with a specified background color (default is white) for converting RGBA to RGB, and a specified compression rate (default is 90). Images with a specified width or height attribute in pixels get scaled down to that size to reduce loading time. This helps severely reduce the size of generated pages for markdown files with lots of images. There is also an option to save all images in multiple sizes and let the html viewer/browser pick the one fitting for the viewport size (using the img srcset attribute), thus making gh-md-to-html the only md-to-html converter with builtin srcset support for image load reduction. * If two equal images from equal or different sources are referenced in the given markdown file, and both would be saved in the same resolution et cetera, both are pointed to the same copy in the generated html to minimize loading overhead. * Comes with an option to closely imitate GitHub's markdown-to-html-conversion behavior offline! * Emoji shortcode support. * Probably even more than that - this list here is no longer maintained, refer to the documentation further down this README for all options.

In case you are looking for an alternative to Pandoc for converting markdown to PDF, here is a list of reasons why you could want to use gh-md-to-html instead of Pandoc for the job:

Reasons to use this instead of Pandoc Whilst using pandoc to convert from markdown to pdf usually yields more beautiful results (pandoc uses LaTeX, after all), gh-md-to-html has its own set of advantages when it comes to quickly converting complex files for a homework assignment or other purposes where reliability weights more than beauty: * pandoc converts .md to LaTeX and then renders it to pdf, which means that images embedded in the .md are shown where they fit best in the .pdf and not, as one would expect it from a .md-file, exactly where they were embedded. * pandoc's pandoc-flavored markdown supports formulas; however, some specific rules apply regarding the amount of whitespace cornering the `$`-signs and what characters the formula may start with. These rules do not apply in some common markdown editors like MarkText, though, which leads to lots of frustration when formulas that worked in the editor don't work anymore when converting with pandoc (MarkText's own export-to-pdf-function sometimes fails on formula-heavy files without an error message, though, which makes it even less reliable). The worst part is that, whenever pandoc fails converting .md to .pdf because of this, it shows the line number of the error based on the intermediate .tex-file instead of the input .md-file, which makes it difficult to find the problem's root. As you might have guessed, gh-md-to-html couldn't care less about the amount of whitespace you start your formulas with, leaving this decision up to you. * pandoc supports multiple markdown flavors. The sole formula-supporting one of these is pandoc-flavored markdown, which comes with some quite specific requirements regarding the amount of trailing whitespace before a sub-list in a nested list, and other requirements to create multi-line bullet point entries. These requirements are not fulfilled my many markdown-editors (such as MarkText) and not required by many other markdown flavors, causing pandoc to not render multiline bullet point entries and nestled lists correctly in many cases. gh-md-to-html, on the other hand, supports **both** nested lists like you would expect it, **and** formulas, releasing the burden of having to edit entire markdown files to make then work with pandoc's md-to-html-conversion from your shoulders. To sum it up, pandoc's md-to-pdf-conversion acts quite unusual when it comes to images, nested lists, multiline bullet point entries, or formulas, and gh-md-to-html does not.

Installation

Use pip3 install gh-md-to-html to install directly from the python package index, or python3 -m pip install gh-md-to-html if you are on Windows.

Both might require sudo on Linux, and you can optionally do

python3 -m pip install gh-md-to-html[pdf_export]

and install wkhtmltopdf (v0.12.6 or greater) to get the optional pdf-conversion feature and convert markdown files to pdf, and/or

python3 -m pip install gh-md-to-html[offline_conversion]

to get the optional offline-conversion feature up and running.

If you are on Windows, you might have to add wkhtmltopdf to your path in your current working directory in order to get pdf conversion to work, e.g. with PATH=%PATH%;c:/program files/wkhtmltopdf/bin or something similar, depending on your installation location.

Usage

If you want to access the interface with your command line, you can just supply gh-md-to-html with the arguments documented in the help text (accessible with gh-md-to-html -h and shown below). On windows, you must supply python3 -m gh_md_to_html with the corresponding arguments.

If you want to access the interface via python, you can use

import gh_md_to_html

and then use gh_md_to_html.main() with the same arguments (and default values) you would supply to the command line interface.

If you only want to imitate the conversion results yield by GitHub's REST API offline, but don't want image caching, formula support and fancy CSS styling, use

html_as_a_string = gh_md_to_html.core_converter.markdown(your_markdown_as_a_string)

in Python.

Documentation

Documentation (throughout introduction for starters - NOT FINISHED YET)
* **Usage**: `gh-md-to-html ` * **Default behavior**:
By default, gh-md-to-html takes a markdown file name as an argument, and saves the generated HTMl in a file of the same name, with `.html` instead of `.md`.
Some quirks: * The generated CSS is stored in `github-markdown-css/github-css.css` (add `-c` to make it inline instead). * All referenced images are cached, stored & referenced in `./images` (add `-i` to disable this). * All image & css links assume that you want to host the html file with your current directory as the root directory (add `-w` if you want to directly view it in a browser instead). * All `id`s and file-internal links are prefaced by `user-content-`, so you can embed the generated html in a bigger website without risking ID clashes. * **Some common use cases**:
Through past issues, I realised that there are some very common use cases that most people seem to have for this module. Here are the most common ones, and which options and arguments to use for them: * **preview a GitHub README**: use `-i -w --math false --box-width 25cm`, though [grip](https://github.com/joeyespo/grip) might be more efficient for this purpose. * **preview a GitLab README**: see above, and add `--toc` to support GitLab's TOC syntax. * **as an alternative to pandoc-flavored markdown**: use `--math true --emoji-support 0 --dont-make-images-links true`. * **having everything in one file**: use `-i -c` to have everything in one file. * **Converting markdown files from the web** with `--origin-type`:
You might want to not only convert a local markdown file, but also a file from a GitHub repository, a web-hosted one, or the contents of a string. Simply downloading these or storing them in a file is often not enough, since their location on the web also influences how the links to images they reference must be resolved. Luckily, gh-md-to-html has got your back!
There is a number of different arguments you can use to describe what kind of file the input you gave references: * `--origin-type file`: The default; takes a (relative or absolute) file path * `--origin-type repo`: Takes a pth to a markdown file in a github repository, in the format `///.md`. * `--origin-type web`: Takes the url of a web-hosted markdown file. * `--origin-type string`: Takes a string containing markdown. Some of these options you use influences how image links within the markdown file are resolved; a later section of this README outlines this in detail. * **Fine-tuning what goes where**:
gh-md-to-html is written with the goal of generating a host-ready static website for you, with your current working directory as its root. Aside from using `-w` to disable this and allow you to view the generated file directly in a browser, there are a number of options that allow you to fine-tune what goes where, and most popularly, change the root of the website. There is no need to do so unless you want to for some reason, so don't bother reading this if you don't need to! * `--website-root`(or `-w`): Leaving this option empty, as discussed above, allows you to preview the generated html file directly in a browser (on most systems by double-clicking it) in case you don't want to host the generated html file, but you can also supply any directory that you want to use as the website's root to this. It defaults to your current working directory. * `--destination` (or `-d`): The path, relative to `--website-root`, in which the generated html file is stored. By default, the website root is used for this. * `--image-paths` (or `-i`): You can leave this empty to disable image caching, as described above (though this won't work in case you modified `--origin-type`), or supply a path relative to website-root to modify where images are stored. It defaults to `images`.
Image caching makes sure that two pixel-identical images are stored in the same file location, to minimize loading time for files with multiple identical images. The `image-paths`-directory isn't automatically emptied between multiple runs of gh-md-to-html for this reason, to ensure that this optimization can be used cross-file when converting multiple files in a bulk. * `--css-paths` (or `-c`): You can leave this empty to disable storing the CSS in an external CSS file (useful e.g. if you want to convert only one file), as described above, or supply a path relative to website-root to modify where the CSS file (called `github-css.css`) will be stored. The default is `github-markdown-css`. * `--output-name` (or `-n`): The file name under which to store the generated html file in the destination-directory. You can use `` anywhere in this string, and it will automatically be replaced with the name of the markdown file, so, for example, `gh-md-to-html inp.md -n "-conv.html"` will store the result in `ino-conv.html` (this doesn't work with `--origin-type string`, of course).
You can also use `-n print` in order to simply write the output to STDOUT (print it on the console) instead of saving it anywhere. The default value is `.html`, so it adapts to your input file name. * `--output-pdf` (or `-p`): The file in which to store the generated pdf. You can use the ``-syntax here as well. If the `-p`-option isn't used, no pdf will be generated (and you need to have followed the pdfkit & wkhtmltopdf installation instructions above to have this option work), but you can use `-p` without any arguments to have it use `.pdf` as a sensitive file name default. * **exporting as pdf**:
As mentioned above, you can export the generated HTML file as a pdf using the `--output-pdf`-option. Doing so requires you to have `wkhtmltopdf` installed (the Qt-patched version), to add it to the PATH (if you are on Windows), and to have `pdfkit` installed (e.g. via `pip3 install gh-md-to-html[offline_conversion]`), but all of these requirements are already outlined above in the [installation](#installation) section.
There are some things worth noting here, though. First of all, DO NOT use this option if you have valuable information in a file called `{yourpdfexportdestination}.html`, where `{yourpdfexportdestination}` is what you supplied to `-p`, since this file will be temporarily overwritten in the process; furthermore, do not use `-p` at all if you are supplying untrusted input to the `-x`-option.
There are also some options specifically tailored for use with `-p`; these are currently: * `--style-pdf` (or `-s`): Set this to `false` to disable styling the generated PDF file with GitHub's CSS. You might want to do this because the border that GitHub's CSS draws around the page can look counterintuitive in PDFs, though doing so can also negatively influence the appearance of other parts, so use this with a grain of salt. * **changing which core markdown converter to use**:
gh-md-to-html doesn't actually do all that much heavy lifting itself when it comes to parsing markdown and converting it to PDF; instead, it wraps around a so-called "core converter" that does the basic conversion according to the markdown spec, and builds its own options, features, customizations and styling on top of that. By default, the GitHub markdown REST API is used for that, since it comes closest to what GitHub does with its READMEs, but you can also give gh-md-to-html any other basic markdown converter to work with. gh-md-to-html also comes with two build-in alternative core converters to use, that imitate GitHub's REST API as close as possible whilst adding their own personal touch to it. Option to decide the core converter: * `--core-converter` (or `-o`): You can use this option to choose from a number of pre-defined core converters (see below) in case you want to differ from the default one. You can also supply a bash command (on UNIX/Linux systems) to this, or a cmd.exe command on Windows, in which `{md}` stands as a placeholder for where the shell-escaped input markdown will be inserted by gh-md-to-html. For example,
`gh-md-to-html inp.md -o "pandoc -f markdown -t html <<< {md}"`
will use pandoc as its core converter.
You can also do so using multiple commands, like
`gh-md-to-html -o "printf {md} >> temp.md; pandoc -f markdown -t html temp.md; rm temp.md"`,
as long as the result is printed to stdout. If you use the Python-interface to gh-md-to-html, you can also supply any function that converts a markdown string into a html string to this argument. Pre-defined core converters that you can easily supply to `--core-converter` as strings: * `OFFLINE`: Imitates GitHub's markdown REST API, but offline using mistune. This requires the optional dependencies for "offline_conversion" to be satisfied, by using `pip3 install gh-md-to-html[offline_conversion]` or `pip3 install mistune>=2.0.0rc1`. * `OFFLINE+`: Behaves identical to OFFLINE, but it doesn't remove potentially harmful content like javascript and css like the GitHub REST API usually does. DO NOT USE THIS FEATURE unless you need a way to convert secure manually-checked markdown files without having all your inline js/styling stripped away! * **support for inline-formulas**:
`gh-md-to-html` supports, by default, inline formulas (no matter which core converter, see above, you use). This means that you can write a LaTeX formula between two dollar signs on the same line, and it will be replaced with an SVG image displaying said formula. For example,
`$e = m \cdot c^2$`
will add Einstein's famous formula as a svg image, well-aligned with the rest of the text surrounding it, into your document. `gh-md-to-html` always tries to use your local LaTeX installation to do this conversion (advantage: fast and doesn't require internet). However, if [LaTeX](https://www.tug.org/texlive/) or [dvisvgm](https://dvisvgm.de/) are not installed or it can't find them, it uses [an online converter](https://latex.codecogs.com/) (advantage: doesn't require you to install 3 GB of LaTeX libraries) to achieve this. You can use the following options to modify this behavior: * `--math` (or `-m`): Set this to `false` to disable formula rendering. * `--suppress-online-fallbacks`: Set this to `true` to disable the online fallback for formula rendering, raising an error if its requirements aren't locally installed or can't be found for some reason. * **image caching and image compression**:
As explained in-depth above, gh-md-to-html saves images so they can all be loaden from the same folder. This comes with the advantages of * potentially reducing tracking (in case the images where hosted on a 3rd-party website) * reducing the number of DNS lookups required to show your generated HTMl file (in case the images where hosted on different 3rd-party websites) * reducing the number of images to load (if one or multiple md files you intend to host or view as html files contain the same or pixel-identical images) In addition to these advantages, gh-md-to-html also allows you to set a level of image compression to use for these images. If you decide to do so, every image will be converted to JPEG (using a background color and quality settings of your liking), and images will be downscaled if the generated html states that they won't be needed at their full size anyways (you can make use of this e.g. by using ``-tags directly in your document and supplying them with an explicit `width` or `height` value). gh-md-to-html is also the only markdown converter capable of making use of the html `srcset`-attribute, which allows the generated document to reference several differently scaled versions of the same image, of whom the browser will then load the smallest large-enough one on smaller screen sizes, leading to great load reductions e.g. on mobile. Enabling this feature can lead to further loading time reductions without sacrificing any visible image quality, which makes gh-md-to-html the best choice if you want to generate fast-loading websites from your image-heavy markdown files. The option to use for all of this is * `--compress-images`.
and it accepts a piece of JSON data with the following attributes: * `bg-color`: the color to use as a background color when converting RGBA-images to jpeg (an RGB-format). Defaults to "`white`" and accepts almost any HTML5 color-value ("`#FFFFFF`", "`#ffffff`", "`white`" and "`rgb(255, 255, 255)`" would've all been valid values). * `progressive`: Save images as progressive jpegs. Default is False. * `srcset`: Save differently scaled versions of the image and provide them to the image in its srcset attribute. Defaults to False. Takes an array of different widths or `True`, which serves as a shortcut for "`[500, 800, 1200, 1500, 1800, 2000]`". * `quality`: a value from 0 to 100 describing at which quality the images should be saved. Defaults to 90. If a specific size is specified for a specific image in the html, the image is always converted to the right size *before* reducing the quality. If this argument is left empty, no compression is used at all. If this argument is set to True, all default values are used. If it is set to json data and some values are omitted, the defaults ones are used for these. You can also pass a dict instead of a string containing JSON data if you are using this option in the Python frontend. Image compression won't work, for obvious reasons, if you use `-i` to disable image caching. * **my personal choices**:
GitHub-flavored markdown and markdown in general makes some unpopular choices, and gh-md-to-html, imitating it, also makes a lot of these. If your goal isn't to be as close as possible to (github-flavored) markdown, and you want to utilize the full power that gh-md-to-html offers to the fullest, I recommend the following (very opinionated) list of settings and options. Note that some of these aren't safe when converting user-generated content, though. * `--math true`: This is already enabled by default, so not really a recommendation, but you'll most likely want to have LaTeX math support in your file. * `--core-converter OFFLINE+`: This converts the markdown files offline instead of using GitHub's REST API, and allows the use of unsafe things like inline code and every html you could wish for in your markdown file. * `--compress-images`: There are many ways to finetune this options, but it allows for some great optimizations on the cached images, including the use of the HTML `srcset`-attribute, which no other markdown converter currently supports afaik. * `--box-width 25cm`: You'll most likely want to limit the width of the box in which the generated website's content is displayed [for reasons of readability](https://en.wikipedia.org/wiki/Line_length), unless you plan to embed the generated html into a bigger html file. * `--toc true`: This allows you to use `[[_TOC_]]` as a shortcut for a table of contents in the generated file. * `--dont-make-images-links true`: By default, GitHub wraps every image into a link to the image source, unless the image is already wrapped into a different link. This option disables this behavior for more control over your image's links. * `--emoji-support 2`: gh-md-to-html supports using emoji shortcodes, like `:joy:`, which are then replaced with emojis in the generated html file. `--emoji-support 2` takes this one level further this by allowing you to use your own custom emojis, so `:path/to/funny_image.png:` will add `funny_image.png` as an emoji-sized emoji into the text. * `--soft-wrap-in-code-boxes true`: By default, GitHub displays its multiline code boxes with a horizontal scrollbar if they are at a risk of overflowing. Use this option to have (imho more reasonable) soft-wrap in code boxes instead.
Help text (look up what every option does) All arguments and how they work are documented in the help text of the program, which looks like the following. Please note that the options are listed ordered by relevance, and all of them have sensible defaults, so don't feel overwhelmed by how many there are; you can just read through them until you find what you where looking for, and safely ignore the rest.
Most of the options are meant to customize default behavior, so none of them are mandatory for most use cases. ``` usage: __main__.py [-h] [-t {file,repo,web,string}] [-w WEBSITE_ROOT [WEBSITE_ROOT ...]] [-d DESTINATION [DESTINATION ...]] [-i [IMAGE_PATHS [IMAGE_PATHS ...]]] [-c CSS_PATHS [CSS_PATHS ...]] [-n OUTPUT_NAME [OUTPUT_NAME ...]] [-p OUTPUT_PDF [OUTPUT_PDF ...]] [-s STYLE_PDF] [-f FOOTER [FOOTER ...]] [-m MATH] [-x EXTRA_CSS [EXTRA_CSS ...]] [-o CORE_CONVERTER [CORE_CONVERTER ...]] [-e COMPRESS_IMAGES [COMPRESS_IMAGES ...]] [-b BOX_WIDTH [BOX_WIDTH ...]] [-a TOC] MD-origin [MD-origin ...] Convert markdown to HTML using the GitHub API and some additional tweaks with python. positional arguments: MD-origin Where to find the markdown file that should be converted to html optional arguments: -h, --help show this help message and exit -t {file,repo,web,string}, --origin-type {file,repo,web,string} In what way the MD-origin-argument describes the origin of the markdown file to use. Defaults to file. The options mean: * file: takes a relative or absolute path to a file * repo: takes a path to a markdown-file in a github repository, such as ///.md * web: takes an url to a markdown file * string: takes a string containing the files content -w WEBSITE_ROOT [WEBSITE_ROOT ...], --website-root WEBSITE_ROOT [WEBSITE_ROOT ...] Only relevant if you are creating the html for a static website which you manage using git or something similar. --website-root is the directory from which you serve your website (which is needed to correctly generate the links within the generated html, such as the link pointing to the css, since they are all root- relative), and can be a relative as well as an absolute path. Defaults to the directory you called this script from. If you intent to view the html file on your laptop instead of hosting it on a static site, website-root should be a dot and destination not set. The reason the generated html files use root-relative links to embed images is that on many static websites, https://foo/bar/index.html can be accessed via https://foo/bar, in which case relative (non-root- relative) links in index.html will be interpreted as relative to foo instead of bar, which can cause images not to load. -d DESTINATION [DESTINATION ...], --destination DESTINATION [DESTINATION ...] Where to store the generated html. This path is relative to --website-root. Defaults to "". -i [IMAGE_PATHS [IMAGE_PATHS ...]], --image-paths [IMAGE_PATHS [IMAGE_PATHS ...]] Where to store the images needed or generated for the html. This path is relative to website-root. Defaults to the "images"-folder within the destination folder. Leave this option empty to completely disable image caching/downloading and leave all image links unmodified. -c CSS_PATHS [CSS_PATHS ...], --css-paths CSS_PATHS [CSS_PATHS ...] Where to store the css needed for the html (as a path relative to the website root). Defaults to the "/github-markdown-css"-folder. -n OUTPUT_NAME [OUTPUT_NAME ...], --output-name OUTPUT_NAME [OUTPUT_NAME ...] What the generated html file should be called like. Use within the value to refer to the name of the markdown file that is being converted (if you don't use "-t string"). You can use '-n print' to print the file (if using the command line interface) or return it (if using the python module), both without saving it. Default is '.html'. -p OUTPUT_PDF [OUTPUT_PDF ...], --output-pdf OUTPUT_PDF [OUTPUT_PDF ...] If set, the file will also be saved as a pdf file in the same directory as the html file, using pdfkit, a python library which will also need to be installed for this to work. You may use the variable in this value like you did in --output-name. Do not use this with the -c option if the input of the -c option is not trusted; execution of malicious code might be the consequence otherwise!! -s STYLE_PDF, --style-pdf STYLE_PDF If set to false, the generated pdf (only relevant if you use --output-pdf) will not be styled using github's css. -f FOOTER [FOOTER ...], --footer FOOTER [FOOTER ...] An optional piece of html which will be included as a footer where the 'hosted with <3 by github'-footer in a gist usually is. Defaults to None, meaning that the section usually containing said footer will be omitted altogether. -m MATH, --math MATH If set to True, which is the default, LaTeX-formulas using $formula$-notation will be rendered. -x EXTRA_CSS [EXTRA_CSS ...], --extra-css EXTRA_CSS [EXTRA_CSS ...] A path to a file containing additional css to embed into the final html, as an absolute path or relative to the working directory. This file should contain css between two