lapwat / papeer

Scrape the web in the eink era. Convert websites into ebooks and markdown.
https://papeer.tech
GNU General Public License v3.0
234 stars 19 forks source link

Support for <tt> for code/typewriter text #5

Closed ljrk0 closed 2 years ago

ljrk0 commented 2 years ago

As per https://github.com/JohannesKaufmann/html-to-markdown/issues/49, some websites don't use semantic markup but specify <tt> directly.

Adding this rule for the markdown converter improves the output considerably:

    converter.AddRules(
        md.Rule {
            Filter: []string{"tt"},
            Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
                content = "`" + content + "`"

                return &content
            },
        },
    );
ljrk0 commented 2 years ago

Note that this will not deal with collapsing of inline spaces, unescaping, etc. -- the logic already exists here https://github.com/JohannesKaufmann/html-to-markdown/blob/master/commonmark.go#L297 but uses private methods that'd need to be duplicated.