mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
756 stars 138 forks source link

Incremental / Block-wise input #148

Closed vanrein closed 3 years ago

vanrein commented 3 years ago

I like the minimalism of your design! I was looking for a simple Markdown parser as a prefix to terminal rendering with ANSI colour, and this seems to be it. Commands producing Markdown output can be read pleasantly on the commandline with that sort of tool ending in a $PAGER.

If it had an incremental or block-wise input mode, basically md_open(), md_write(), md_close(), it would be possible to decode Markdown while text is being produced. Have you considered such an option?

I realise that sometimes you need to lookahead to know what callbacks to produce; for instance, a title followed by ==== or a buffer cut halfway & codes. If you want to avoid allocation, one option might be to make md_write() return the number of bytes processed, so that the remainder gets offered again in the next run, hopefully with extensions. That would make the caller shift the remainder to the start of the buffer, read more, and try again, hopefully with & completely shown.

Do you agree that this might be useful?

mity commented 3 years ago

If it had an incremental or block-wise input mode, basically md_open(), md_write(), md_close(), it would be possible to decode Markdown while text is being produced. Have you considered such an option?

Yes, I have considered it when starting the project. Unfortunately, after studying the Commonmark specification more closely, I've come to the conclusion it would make no real benefit.

To parse the Markdown you have to make at least two passes over the complete document. For example, in general, it's not possible to do an inline analysis of any paragraph until you've collected all link reference definitions in the document. Similarly you cannot decide whether a list is loose or tight until you saw all of it. (And list can be of an arbitrary length and can form a complete document too.) Also, some other implementations implement other features which need it (e.g. footnotes) and those may also be eventually supported in MD4C.

I.e., with such an API, the implementation would have to remember all the input it's fed with anyway, instead of some streaming-like processing.

vanrein commented 3 years ago

To parse the Markdown you have to make at least two passes over the complete document. For example, in general, it's not possible to do an inline analysis of any paragraph until you collected all link reference definitions in the document. Similarly you cannot decide whether a list is loose or tight until you saw all of it. (And list can be of an arbitrary length and can form a complete document too.) Also, some other implementations implement other features which need it (e.g. footnotes) and those may also be eventually supported in MD4C.

Thanks for explaining that. It is convincing, even if a subclass of documents might have more locality, especially the kind of applications that I have in mind. That is what I had in mind, but after first thinking "let's quickly put a Markdown parser together" I concluded that it wasn't trivial.

It sounds like unrestricted Markdown isn't suitable for pipeline filtering, as I am doing now, but still quite useful for batch-mode processing of output. And that is precisely what you are doing. Yes, that makes complete sense to me now.

I.e., with such an API, the implementation would have to remember all the input it's fed with anyway, instead of some streaming-like processing.

...and the API was designed to allow precisely that style of use. Yes, this is proper design, my compliments and thanks again.