mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.86k stars 880 forks source link

Performance issues #265

Open fabiospampinato opened 5 years ago

fabiospampinato commented 5 years ago

I've been using turndown in enex-dump for converting Evernote's notes to markdown. I've tried to optimize it's performance, but the main bottleneck comes from turndown, here you can check the actual function I'm using, I've tried removing all my custom regexes and extra rules and they barely make any difference.

Part, if not most, of the bloat I think comes from jsdom, which turndown imports. Just because of jsdom I'm seeing ~10MB of dependencies getting installed, which is definitely not good for my use case where I'm embeddeding enex-dump in an Electron application. Also jsdom provides an implementation of the DOM, which I'm not sure a project like turndown really needs, shouldn't a fast xml parser like fast-xml-parser get the job done with significantly less bloat?

Thanks.

domchristie commented 5 years ago

One of the main benefits of using the DOM is that you get HTML parsing for free in browsers:

https://github.com/domchristie/turndown/blob/e2c64f121a170ad04363c7da3bbf978ebcde5f7f/src/html-parser.js#L41-L47

We also use methods defined on DOM nodes for querying. This makes it very easy for developers to extend the rules because the DOM Node/Element interface is familiar. Of course we could re-implement all these methods if we were to parse HTML as JSON, but that seems unnecessary given the above, and we'd lose the familiarity.

I'm not very familiar with Electron, so this could be wrong, but if it effectively just renders HTML to a webview, then it should be possible to use the built in parser, rather than importing JSDOM. For this you should import the browser version from lib/turndown.browser.es.js.

Hope that helps.

fabiospampinato commented 5 years ago

I don't understand the title change, enex-dump is a standalone application and has nothing to do with Electron, turndown is not bloated and slow only when used in Electron applications.

I get your point of view, after all the maintainers are the ones putting the actual work into this, and the project should be something they feel comfortable with. I just wanted to express a bit of frustration since the cause of slowness/bloat for my use case comes from this library.

I'm not very familiar with Electron, so this could be wrong, but if it effectively just renders HTML to a webview, then it should be possible to use the built in parser, rather than importing JSDOM. For this you should import the browser version from lib/turndown.browser.es.js.

Interesting, maybe I could instruct webpack to load the es version for browsers instead.

The problem with this is that, if I understand this correctly, one should be using turndown in a renderer process (the one that actually renders the HTML) to take advantage of this, but it's generally better to do the heavy work on a separate (non-renderer, I guess) process so that the UI feels responsive at all times.

Thanks!

domchristie commented 5 years ago

I don't understand the title change, enex-dump is a standalone application and has nothing to do with Electron, turndown is not bloated and slow only when used in Electron applications.

Sorry if I misunderstood. You mentioned your use case was "embeddeding enex-dump in an Electron application" so I thought that could be where the problem lies. I carried out some brief performance tests after launching v4, and it seemed to be significantly faster than other similar libraries. These tests were performed using the CommonJS build on a Node.js server. As I mentioned, I think the underlying issue is that you're using the non-browser build, and therefore embedding JSDOM within a browser, which is not its intended use :/ We may need to look into updating a few package.json fields in order to prevent this happening so much.

I get your point of view, after all the maintainers are the ones putting the actual work into this, and the project should be something they feel comfortable with. I just wanted to express a bit of frustration since the cause of slowness/bloat for my use case comes from this library.

I'm open to pull requests and feedback. You are obviously more than welcome to explore the source code, replace JSDOM with a parser of your choosing, or suggest improvements. However, the issues aren't really a place to vent frustration in this way.

Interesting, maybe I could instruct webpack to load the es version for browsers instead.

👍

it's generally better to do the heavy work on a separate (non-renderer, I guess) process

Technically true, but I'd recommend trying it and seeing how it feels. If you're converting a significant amount of HTML then you may want to perform this on the server.

fabiospampinato commented 5 years ago

Sorry if I misunderstood. You mentioned your use case was "embeddeding enex-dump in an Electron application" so I thought that could be where the problem lies.

My bad, I should have said more explicitly that I basically have 2 use cases for turndown: 1) using it in enex-dump 2) using it indirectly in an Electron app that uses enex-dump.

I carried out some brief performance tests after launching v4, and it seemed to be significantly faster than other similar libraries.

I don't doubt that, but in my tests turndown needs about 80ms to process a particular html string, which I can convert to json in about 2ms with fast-xml-parser. I'm kind of comparing apples to oranges here, but should converting the document to markdown take ~40x as much as parsing it? Maybe not, I think there's quite some room for improvement here.

As I mentioned, I think the underlying issue is that you're using the non-browser build, and therefore embedding JSDOM within a browser, which is not its intended use :/

I think we should consider my 2 use cases separately for this:

However, the issues aren't really a place to vent frustration in this way.

I'm sorry, but I've opened this issue because turndown requires ~10mb of dependencies and because I think it's not very performant, aren't these good enough reasons for opening an issue? The fact that this frustrates me is irrelevant.

Technically true, but I'd recommend trying it and seeing how it feels. If you're converting a significant amount of HTML then you may want to perform this on the server.

The UI blocks for multiple minutes when converting a large enough number of html strings (100s to 1000s of html strings), on the renderer process. I'll definitely try to run this on another process, hoping to have access to the browser's native parser there.

I'm open to pull requests and feedback. You are obviously more than welcome to explore the source code, replace JSDOM with a parser of your choosing, or suggest improvements.

Unfortunately I don't have the time to make major changes (switching from a full-blown DOM implementation to an xml parser) to this unfamiliar codebase, but I think there might be a few things I can help with:

  1. I have to check if within Electron turndown is still using jsdom, if that's the case it should be fixed.
  2. jsdom should be loaded conditionally, in such a way that it's import can be removed entirely via dead code elimination.
domchristie commented 5 years ago

I've opened this issue because turndown requires ~10mb of dependencies and because I think it's not very performant, aren't these good enough reasons for opening an issue? The fact that this frustrates me is irrelevant.

Opening an issue to discuss performance issues is fine. Stating that "Turndown is slow and bloated" probably isn't the best way to start one.

I try as much as possible to reduce the number of dependencies; dealing with frequent updates, incompatibilities and conflicts is never fun! It's why Turndown aims to use the simplest set up possible: virtually no transpilation—just ES6 modules compiled with Rollup using npm scripts. JSDOM was chosen early on because it is reliable and well maintained. It offers a solid implementation of the DOM allowing for Turndown rules to use familiar node/element properties, and for them to be shared across different environments: Node and browser. If there is a library that offers similar levels of reliability, with a nice API, that is well maintained, then I'd happily switch to it.

I don't think I'm doing anything "unintended" here though, just importing turndown and using it.

This is a key point. In a Node.js environment, the a node version (with JSDOM) will probably be used. For building JS for electron apps, this is obviously not what we want. There is some discussion around this issue (https://github.com/defunctzombie/package-browser-field-spec) but I've not looked into its status.

One thing to note re: enex-dump, you are using an old version of Turndown. Turndown v5 updated the escaping to be more aggressive, but almost certainly more efficient.

domchristie commented 5 years ago

jsdom should be loaded conditionally, in such a way that it's import can be removed entirely via dead code elimination.

I believe this is already the case:

https://github.com/domchristie/turndown/blob/e2c64f121a170ad04363c7da3bbf978ebcde5f7f/src/html-parser.js#L27-L56

process.browser determines whether JSDOM is required or not. This is set in config: https://github.com/domchristie/turndown/tree/master/config

fabiospampinato commented 5 years ago

Opening an issue to discuss performance issues is fine. Stating that "Turndown is slow and bloated" probably isn't the best way to start one.

That was just a descriptive short title for the content of my message, I didn't mean to offend anyone with it, sorry if it did.

If there is a library that offers similar levels of reliability, with a nice API, that is well maintained, then I'd happily switch to it.

If you're looking for another DOM implementation I'm not sure there are significantly better ones. My argument is that using a XML parser instead would be better for performance, perhaps at the cost of (slightly?) larger bundles for browsers.

One thing to note re: enex-dump, you are using an old version of Turndown. Turndown v5 updated the escaping to be more aggressive, but almost certainly more efficient.

Cool, I didn't realize another version was out. I just benchmarked it and it seems the performance improvement is about 0.05% for me.

jsdom should be loaded conditionally, in such a way that it's import can be removed entirely via dead code elimination. I believe this is already the case:

Kind of, you're completely omitting the import from the bundle, but if the right bundle isn't loaded (which seems to be the case for me) that doesn't matter. Maybe the browser property in package.json can solve that, I'm not very familiar with it 🤔

domchristie commented 5 years ago

Kind of, you're completely omitting the import from the bundle, but if the right bundle isn't loaded (which seems to be the case for me) that doesn't matter

Yes, so I suppose the solution is to ensure the correct version is loaded. The possible versions should probably be documented to help this, so FWIW:

For Node.js environments (JSDOM will be required):

For browser environments (JSDOM will not be required):

fabiospampinato commented 5 years ago

Those are a lot of bundles.

I did some tests and:

  resolve: {
    alias: {
      turndown$: require ( 'path' ).join ( __dirname, 'node_modules', 'turndown', 'lib', 'turndown.browser.cjs.js' )
    }
  }

I see turndown already has a browser property in its package.json:

https://github.com/domchristie/turndown/blob/e2c64f121a170ad04363c7da3bbf978ebcde5f7f/package.json#L9-L11

which if my understanding is correct in this particular case should instruct the bundler not to include "jsdom" along with the bundle when bundling for the browser. But I'm not sure how webpack handles this, maybe it uses it if the target is web but not if it is electron-renderer.

Both turndown and jsdom weren't considered external dependencies during my tests.

I'll try to profile turndown and check if there's any low hanging fruit with regard to performance.

domchristie commented 5 years ago

Thanks for taking the time to investigate this further!

Those are a lot of bundles.

The state of JS 😩

Would it be possible to do something like the following in your source:

require('turndown/lib/turndown.browser.es.js')

Perhaps something like this approach might be worth looking into: https://blog.patrickhulce.com/2016/09/15/exposing-multiple-exports-in-a-single-npm-package/?

If I patch package.json manually and force it to load the right file then the performance difference isn't really noticeable, so I guess jsdom's implementation of the DOM isn't significantly slower than chrome's, which would be quite surprising to me, or the cause of the slowness is elsewhere in the codebase.

Interesting. Previous performance issues were due to slow-running regular expressions (https://github.com/domchristie/turndown/issues/188#issuecomment-332216019). These issues should have been resolved by now, but it might be a good starting point?

fabiospampinato commented 5 years ago

Would it be possible to do something like the following in your source: require('turndown/lib/turndown.browser.es.js')

enex-dump is supposed to also work on node environments, so I can't do that. The Electron app that imports enex-dump has no access to that import (I mean unless I manually patch the dependency, which I'd rather not do) so I can't do that there either.

Perhaps something like this approach might be worth looking into: https://blog.patrickhulce.com/2016/09/15/exposing-multiple-exports-in-a-single-npm-package/?

This could be a solution, but I'd prefer to have just 1 export from enex-dump.

Interesting. Previous performance issues were due to slow-running regular expressions (#188 (comment)). These issues should have been resolved by now, but it might be a good starting point?

Regexes may be the bottleneck, I think I'll just profile turndown with the devtools and see what I get. I'll report back as soon as I have some numbers, probably early next week.

Vaelatern commented 5 years ago

Just notable that there are 96 dependencies including jsdom, including a package for working with ssh keys. Would a PR to slim down jsdom as a new package, or as an internal dependency, be accepted?

fabiospampinato commented 5 years ago

@domchristie I've quickly benchmarked turndown and I couldn't find any singular function responsible for the slowness, but it seems that garbage collection can be significant (maybe using a DOM implementation is largely responsible for this) and on my machine just requiring turndown (require ( 'turndown' )) takes half a second, I think this might be because lots of files get read from the disk and have to be parsed.

IMHO the less dependencies the better, my guess is that ditching jsdom all together will give the most performance improvements.

What do you think?

domchristie commented 5 years ago

Would a PR to slim down jsdom as a new package, or as an internal dependency, be accepted?

ditching jsdom all together will give the most performance improvements.

These are interesting ideas. My only concern is that it might be tricky to find a package that is as reliable and well maintained as jsdom, and I'd rather not have to maintain it as an internal dependency.

I'm beginning to think that the best approach might be to implement a Bring Your Own Parser for Node.js environments (as suggested in https://github.com/domchristie/turndown/issues/97). In the meantime obviously feel free to fork and experiment with other parsers.

domchristie commented 5 years ago

I'm beginning to think that the best approach might be to implement a Bring Your Own Parser for Node.js environments

I should add that this is sort of already the case, as Turndown accepts DOM elements as input (as well as strings). So if you're not content with the parsing performance of JSDOM, you could parse the HTML yourself and pass in a valid DOM element into turndown. However this does not solve the slowness of installing and requiring JSDOM. I'm wondering if the HTML parser could be lazily loaded and memoized if required. This would mean that calling turndown for the first time be slow, but it might be favourable?

fabiospampinato commented 5 years ago

@domchristie I'm not familiar with turndown's internals, but I guess if turndowns receives a dom node rather than a string then maybe there's no need to require jsdom at all.

But I think this might just move the problem somewhere else, as now the user would have to parse the string into a dom himself, and if he uses jsdom we aren't achieving anything in terms of performance, and if he doesn't use jsdom what will he use? I'm not familiar with jsdom's internals either, but there's a reason if it's quite popular, and I think its maintainers would also prefer to have a 100kb jsdom rather than a multi-megabytes one, but maybe providing a dom implementation is just something you can't do in 100kb.

IMHO the best option here might be to switch to a fast xml parser instead.

domchristie commented 5 years ago

Yes, technically JSDOM is only required in node.js environments when the input to turndown is a string. In practice, the current implementation requires it by default so that it can be readily used when turndown is called; after all, most developers will just have an HTML string they want to convert, rather than a DOM tree.

But I think this might just move the problem somewhere else, as now the user would have to parse the string into a dom himself, and if he uses jsdom we aren't achieving anything in terms of performance, and if he doesn't use jsdom what will he use?

This highlights the predicament well. The library doesn't have to accept string inputs, but doing so makes it very convenient for developers. I've chosen JSDOM for its accuracy, stability, and its standardised output that is compatible in other environments. The trade-off is that it might be a bit slow when either starting up, or when converting large amounts of text. It sort of comes down to fast, good, or cheap—choose two, where "fast" is performance, and "cheap" is refers to my time :D.

So to summarise, my approach has been to accept HTML strings as input to make it easy-to-use, but with the option of using your own parser.


I have been looking at using parse5, which I believe JSDOM uses behind the scenes. parse5-htmlparser2-tree-adapter outputs something close to usable with a few modifications:

I'm sure there'll be more breakages, but that's as far as I have got!

fabiospampinato commented 5 years ago

the current implementation requires it by default so that it can be readily used when turndown is called; after all, most developers will just have an HTML string they want to convert, rather than a DOM tree.

I think it may be better to require the expensive dependencies only when they are actually needed. For instance in Notable turndown is used only when/if the user imports an .enex file, there's no reason to pay for the time to require jsdom until it's actually needed.

But this is the minor problem, at least for my use case, I can lazily import it myself, but my bundled app is still considerably heavier than it needs to be because of jsdom.

It sort of comes down to fast, good, or cheap—choose two, where "fast" is performance, and "cheap" is refers to my time :D. So to summarise, my approach has been to accept HTML strings as input to make it easy-to-use, but with the option of using your own parser.

To be clear I'm not criticizing your design decisions, I would have probably made similar ones, I'm just saying that there's some big room for improvements and I'd be interested in seeing them.

I have been looking at using parse5, which I believe JSDOM uses behind the scenes.

I've taken a look at it and I see it has basically no dependencies, the whole thing may require less than 500kb, that's a huge improvement! If by using it turndown could also be kept mostly, if not completely, backwards compatible then that's even better.

@domchristie thank you for taking a deeper look at this!

domchristie commented 5 years ago

https://github.com/montagejs/minidom looks ideal for what we're after. With only a couple of changes, I have it running with 235/241 assertions passing. Most the failures look like they're to do with collapsing whitespace. This might be an issue with minidom's parse5 dependency (which is v1, whereas the latest parse5 is v5), but it could be that the collapse function is using attributes that are not supported in minidom.

domchristie commented 5 years ago

Down to 3 failing tests due to whitespace issues: https://github.com/domchristie/turndown/tree/minidom

domchristie commented 5 years ago

All tests are passing on https://github.com/domchristie/turndown/tree/minidom. I think I'll cherry-pick a few a commits from there into mater so that the library is mostly compatible with DOM Level 1 (+ textContentfrom level 3 and innerHTML and outerHTML), allowing minidom-generated DOM trees to be passed in.

I still like the features and familiarity that JSDOM provides, but I'm open to suggestions as to how we might reduce the performance cost, whilst maintaining the ease-of-use.

domchristie commented 5 years ago

I'm not familiar with turndown's internals, but I guess if turndowns receives a dom node rather than a string then maybe there's no need to require jsdom at all.

Thinking about this a bit more, I wonder if we create another package which only accepts DOM elements as input (e.g. turndown-core naming suggestions welcome!). Then Turndown could depend on that, and modify the turndown method to parse the input if necessary.

This has the benefits of maintaining the same API, whilst providing a JSDOM-free version for those happy to perform the parsing themselves. The downside is that it becomes a bit confusing for people to know where to raise issues, browse source code, and open pull requests.

fabiospampinato commented 5 years ago

@domchristie I'll play with the minidom branch and report back.

IMHO the best case scenario would be that the library maintains backwards compatibility but gets significantly less heavy. Are there any important features missing in minidom currently?

turndown-core

I would be happy to use that, but if possible I'd try to avoid releasing another entry point for turndown, the current situation is kind of complicated already.

Another option for Node.js builds could be this:

  1. if turndown receives an object we don't load any third party DOM implementations
  2. if it receives a string we look for a DOM implementation a. we check if we are in a browser context (maybe we are inside Electron's renderer process), something like global.window && global.window.document && global.window.document.toString () === '[object HTMLDocument]', and if so use the native implementation b. if not we try to load jsdom c. if that fails we try to load minidom d. if that fails too we throw an error, mentioning that no DOM implementation has been found and maybe the user forgot to install either jsdom or minidom. In this case maybe both jsdom and minidom should be defined as optional dependencies

This has some advantages:

  1. It would be trivial to upgrade to the new version as for most people all they'll have to do would be to npm i jsdom
  2. In case we are inside Electron's renderer process (which I admit might be a narrow use case perhaps) we don't even pay for minidom
  3. No additional entry points are added and the library could be fully backwards compatible once jsdom gets installed

What do you think?

domchristie commented 5 years ago

Nice ideas.

IMHO the best case scenario would be that the library maintains backwards compatibility but gets significantly less heavy.

This would be ideal.

Are there any important features missing in minidom currently?

Unfortunately minidom only supports methods and properties from DOM level 1, which is rather restricting. For example, quering is limited to getElementsByTagname (no querySelector / querySelectorAll), and attributes have to be accessed via getAttribute rather than accessing the property directly. These are probably only minor inconveniences, but enough to be confusing when developers are so used to using them when working with the DOM.

I would be happy to use that, but if possible I'd try to avoid releasing another entry point for turndown, the current situation is kind of complicated already.

I agree.

In this case maybe both jsdom and minidom should be defined as optional dependencies … No additional entry points are added and the library could be fully backwards compatible once jsdom gets installed.

Interesting. I'd not heard of optional dependencies. I suppose the issue is that JSDOM's API may change (which it has done in the last few years). This could be guarded against but it might start to get messy.

Just to go reframe the issue. I think there are three areas where the performance could probably be improved:

  1. npm i (downloading the excessive JSDOM dependencies)
  2. require-ing JSDOM (particularly when it's not needed to parse anything)
  3. Parsing HTML with JSDOM

A turndown-core package would effectively solve those, but I tend to agree that it's all a bit confusing. I'll have another think about how this might be better solved.

Thanks for your thought and input!

Vaelatern commented 5 years ago

For whatever it is worth, I use turndown exclusively from a webpage, so do not use nodejs. I only use npm to obtain the turndown source, for including in my webpage.

fabiospampinato commented 5 years ago

Interesting. I'd not heard of optional dependencies.

Actually those don't work as I thought they did, basically npm always tries to install them but if that fails the installation as a whole doesn't fail. What you want here are actually peer depedencies, which are dependencies npm doesn't try to install by itself but it gives a warning if they are unmet.

I suppose the issue is that JSDOM's API may change (which it has done in the last few years).

I think just locking down the peer dependency to a specific version, "jsdom": "1.2.3" or something perhaps a bit less restrictive, should be enough 🤔.

Also FYI I've just stumbled upon domino, a DOM level 4 implementation. The repo seems a bit on the heavy side (> 5mb) but most of that comes from the test directory.

Also maybe a new parser option could be added to turndown/turndown-core so that any custom DOM implementation could be used (as long as the core assumes at best a DOM level 1 implementation or something like that I suppose this should be doable):

new turndown ({
  parser: str => require ( 'my-dom-implementation' ).parse ( str )
}); 

Having still built-in support for Electron's renderer process and jsdom/minidom would still be nice though.

fabiospampinato commented 5 years ago

@domchristie I've just switched enex-dump to the minidom branch and the time to require turndown went from ~550ms to ~70ms, and the time for parsing a test .enex file went from ~100ms to ~50ms. Also because of the removal of jsdom and another heavy dependency of enex-dump the bundled version of Notable now weighs 11mb less.

That seems pretty good to me.

domchristie commented 5 years ago

@Vaelatern out of interest, which build are you importing, and how are you importing it? (I'm thinking in terms of https://github.com/domchristie/turndown/issues/265#issuecomment-447590252). If you're using a browser build, I suppose the main issue for you is the time is takes to run npm i?

Vaelatern commented 5 years ago

That, the difficulty of auditing my deps, and the extra size I never use but is baked in to my images.

I'm using turndown.browser.umd.js just because I picked one at random until it worked (I was also looking at the README while doing so), and I'm importing with a <script> tag in my webpage.

domchristie commented 5 years ago

Just to clarify, are you including <script src="PATH_TO/turndown.browser.umd.js"> directly in your page rather than including it (via require) as part of a compiled script? I'm wondering if there are any nice ways to import a turndown build other than the default.

Vaelatern commented 5 years ago

Correct, I'm using the script tag instead of using a compiled script. The only other JS on the page, in my case, is a HTML text editor.

bmewburn commented 5 years ago

I'm interested in the minidom branch. I see mention of it passing all tests so I assume it's stable enough. Is it compatible with the gfm plugin? Would you consider publishing this as a 'turndown-lite' version?

bmewburn commented 5 years ago

Another idea is just remove jsdom altogether and make the turndown service accept an optional html parser to replace a native parser. Of course when using in nodejs this would be required but I think anyone coming from a node background would be comfortable passing the dependency in.

domchristie commented 5 years ago

I'm interested in the minidom branch. I see mention of it passing all tests so I assume it's stable enough. … Would you consider publishing this as a 'turndown-lite' version?

The minidom branch is experimental, so I'd say it's reasonably stable :) Unfortunately minidom uses an outdated version of parse5, and hasn't been updated in over four years. I'm also wary of releasing a "lite" version as this would require additional maintenance, and might be confusing to those choosing which one to use.

Is it compatible with the gfm plugin?

No, not yet. It'd have to be updated to use the set of DOM methods that minidom supports.

Another idea is just remove jsdom altogether and make the turndown service accept an optional html parser to replace a native parser.

Yes, this is something I'll definitely consider. Thanks!

bmewburn commented 5 years ago

I made a minidom compatible gfm branch if anyone's interested. https://github.com/domchristie/turndown-plugin-gfm/issues/14

bmewburn commented 5 years ago

I'm finding about 700ms require time improvement with turndown minidom

fabiospampinato commented 5 years ago

@domchristie I really believe the next step should be to remove any built-in parsers from the library and add a "parser" option instead, as I mentioned before, so that I can use the built-in parser when the library is running inside a browser environment and provide my own parser (minidom, jsdom, whatever) when running in a node.js environment.

I'm going to hack something like this starting from your minidom branch, but it'd be great if this was an official package, like perhaps it could be published as turndown-core and then imported by turndown proper or something?

domchristie commented 5 years ago

perhaps it could be published as turndown-core and then imported by turndown proper or something?

Yes. My current thinking is along these lines:

  1. This repo would contain both turndown-core and turndown packages to avoid confusion as to where to log issues, and to save me maintaining two repos. (Most people will be using turndown, but bugs will likely be due to code in turndown-core.)
  2. turndown-core will only support parsed HTML as input but will include an option to hook into the conversation method (e.g. before). The result of this before hook will be converted to Markdown, which will allow for custom parsers
  3. The turndown package will import the turndown-core and add a before option to parse with JSDOM

The turndown implementation might look something like:

import { Service } from 'turndown-core'
import parse from './html-parser'

export default function TurndownService (options) {
  options.before = parse
  Service.call(this, options)
}

Let me know what you think.

nonara commented 4 years ago

Is this still being considered? The proposed updates sound great!

nonara commented 3 years ago

FYI - For those who need fast speed in markdown, I ended up writing a heavily optimized compiler built in TS for max speed centered around node (though it has browser compatibility as well).

The output format is visually very clean, also. It does some unique things like automatically numbering multi-level lists, not allowing excess running linefeeds, etc.

We're able to tear through gigabytes very quickly.

It's up and stable: node-html-markdown

We have a few enhancement issues open to round out some things, so if anyone is interested, I'd love to have a few extra hands!

PS. My intent isn't to denigrate this library or spam. turndown is a great library! This is simply to present an alternative in the hopes that it will be useful for those who need extreme perf in a node environment.

martincizek commented 3 years ago

FYI - For those who need fast speed in markdown, I ended up writing a heavily optimized compiler built in TS for max speed centered around node (though it has browser compatibility as well).

How does it compare to Turndown with node-html-markdown used as HTML parser? We were converting gigabytes of HTMLs too and the real bottleneck was actually in NodeJS startup, when used in "one CLI command per file" mode. The conversions were needed by a non-JS app, so we wrapped Turndown in a microservice in Node and it was lightning fast. The second bottleneck would have been probably JSDOM (now replaced with domino), but even with JSDOM it was faster than the database I/O.

I wonder what the conceptual difference is. Turndown is a library with a fixed HTML parser dependency. node-html-markdown is a library with a fixed HTML parser dependency too. And this opinionated dependency actually became the issue that @domchristie wants to address in the end. In practical terms, a user with their own DOM parser implementation would suffer from an unused dependency. node-html-markdown has apparently the same issue and it has just picked another HTML parser for Node. Or am I missing something?

Indeed, we can expect some performance penalty for the Turndown's flexible rule system, but I doubt it can make significant difference. Looking forward to some benchmarks.

Fingers crossed for your project anyway, it's good to have some competitors. :)

nonara commented 3 years ago

@martincizek I actually wrapped it up before the move to domino. Looks like you guys have made great strides in improving! Nicely done!

I wonder what the conceptual difference is.

My goal was two-fold...

First, I wanted to see how much performance I could get out of node by optimizing the code. That means things like:

That said, there are likely ways to further improve it down the line

The second goal was to produce a human readable format, adding things like:

  1. Proper (unbroken) indentation in multi-line list items
  2. Automatic list numbering
  3. A clean output in terms of consistent linefeeds

This makes for a very human-readable format without needing a markdown viewer.

This opinionated dependency actually became the issue that @domchristie wants to address in the end. In practical terms, a user with their own DOM parser implementation would suffer from an unused dependency.

The big issue for me lines up with the title of this issue - performance! But I get where you're coming from.

I would certainly be open to making the dependency swappable. In my perspective, though, given the benchmarks, I wouldn't see much benefit as the one we use seems to outperform the others and it also has a very small footprint.

Flattened out, it's simply 2 dependencies (one being he, which I believe we'd need anyway).

The total decompressed size is ~60k, so I'm not overly worried about the extra dependency. Though I do plan to add a rollup version for the browser which omits it.

The only real benefit I can see to allowing a custom supplied parser would be maybe to do something like have rust handle the parsing. Though, considering that this parser averages 2ms, I'm not sure how much difference that would make.

Would be curious to hear if you have other thoughts on this, though!

Looking forward to some benchmarks.

Just wrapped up the benchmarking tool. It's about 35% ± 3 faster.

Example:

-------------------------------------------------------------------------------

node-html-makrdown (reused instance): 43.7098 ms/file ± 25.5440 (2.15 MB/s)
node-html-markdown                  : 44.6477 ms/file ± 26.7243 (2.1 MB/s)
turndown (reused instance)          : 67.5310 ms/file ± 36.7826 (1.39 MB/s)
turndown                            : 71.5919 ms/file ± 36.7715 (1.31 MB/s)

-------------------------------------------------------------------------------

Estimated processing times (fastest to slowest):

  [node-html-makrdown (reused instance)]
    100 kB:  45ms
    1 MB:    465ms
    50 MB:   23.27sec
    1 GB:    7min, 56sec
    50 GB:   6hr, 37min, 3sec

  [turndown (reused instance)]
    100 kB:  70ms
    1 MB:    719ms
    50 MB:   35.94sec
    1 GB:    12min, 16sec
    50 GB:   10hr, 13min, 27sec

-------------------------------------------------------------------------------

Comparison to fastest (node-html-makrdown (reused instance)):

  node-html-markdown: -2.10%
  turndown (reused instance): -35.27%
  turndown: -38.95%

-------------------------------------------------------------------------------
martincizek commented 3 years ago

Nice tests. It really seems that the performance is getting close when the parser performance is getting closer. But even if the difference was 30% with the same parser engine, it'd be "roughly the same" for me and other criteria will decide then. When we were doing our big migration project, the backtranslation stability was the most important - i.e. no data and formatting loss, the output renders back to semantically equivalent HTML. The second was to have the output very similar to how a human user would write it.

So the other claims are now quite interesting for me:

Proper (unbroken) indentation in multi-line list items

Any example? The indentation in Turndown conforms to the CommonMark spec and its recommendation of having enough indenting keep the lists items. Although we've overriden the rule, as our users use minimalistic indenting. But minimalistic indenting has its drawbacks - the output may be reindented when a new item is inserted in an ordered list (the number can change from 9 to 10 and then a sublist will be indentet one more space, which can cause false diff if the output is version-tracked).

Automatic list numbering

This just works, even with sublist and list offset is respected too. What is the issue?

A clean output in terms of consistent linefeeds

Do you mean CR vs LF or creating softbreaks to keep the line length under e.g. 80 columns? The former is a quite a minor thing, the latter is quite difficult to achieve without eventually breaking the output.

nonara commented 3 years ago

Looks like I was tired last night! I goofed on my math. 😄 In terms of speed, the figure was actually 76% improvement in speed, which if you're doing big volume and paying for server time like we are, that ends up being fairly significant.

the output renders back to semantically equivalent HTML

Definitely important. Ours is reversible as well.

Any example? (indentation)

It's been a few months since I wrote this, during that time I tried several different libraries and found numerous gotchas in the various libraries out there. Among them were numerous indentation issues on multi-level lists and multi-line LI/OL items not doing proper leading indentation for new lines, which broke the lists.

This just works, even with sublist and list offset is respected too. What is the issue?

IIRC most libraries I saw did not actually output a real count number with ordered lists. They just used 1 repeatedly or something. Some broke the numbering on multi-level lists. I don't recall on how TD handled those, though.

Do you mean CR vs LF

In many libraries, repeating LFs was out of control. I can say that specifically for turndown, but that was using the cheerio branch which was broken in many other areas as it was out of date and unfinished, so perhaps that isn't an issue. I do recall having documents with 8+ blank lines.

What I mean by a clean output is that because we use a shared output buffer, we track linefeeds to produce a 'smart' result. So, for example, say we have two block elements in a row, we don't want a bunch of extra whitespace between them. This library has the ability to determine line separation between specific elements, and is generally mindful about readability in the output. (not sacrificing reversibility)

It really seems that the performance is getting close when the parser performance is getting closer. But even if the difference was 30% with the same parser engine, it'd be "roughly the same".

I haven't tested domino, but I doubt that parser difference is much of a factor now. Most of the time taken tends to lie in our compiler code.

I decided to drill down a little deeper last night into the individual sections of code, because the numbers weren't what I expected, given the considerable effort into JS optimization. There was one tiny function which calculated trailing whitespace using a regex. I replaced it with a proper for loop, and here are the new numbers:

Speed difference is now ~200%. It tends to perform right around 3x faster

-------------------------------------------------------------------------------

node-html-makrdown (reused instance): 21.9178 ms/file ± 12.1507 (4.29 MB/s)
turndown (reused instance)          : 63.5386 ms/file ± 30.7192 (1.48 MB/s)

-------------------------------------------------------------------------------

Estimated processing times (fastest to slowest):

  [node-html-makrdown (reused instance)]
    100 kB:  23ms
    1 MB:    233ms
    50 MB:   11.67sec
    1 GB:    3min, 59sec
    50 GB:   3hr, 19min, 6sec

  [turndown (reused instance)]
    100 kB:  66ms
    1 MB:    676ms
    50 MB:   33.82sec
    1 GB:    11min, 33sec
    50 GB:   9hr, 37min, 11sec

-------------------------------------------------------------------------------

This is far more reflective of what I would expect, although I believe by re-writing the parser, get it below 10ms. But I'll save rewrites for another time. Too much "real" work :)

martincizek commented 3 years ago

Alright, thanks for your insights! And please don't get me wrong, I indeed want to examine every claim that seems to be related to Turndown.