razziel89 / mdslw

Prepare your markdown for easy diff'ing!
GNU General Public License v3.0
13 stars 1 forks source link

Definition lists #14

Closed bugabinga closed 5 months ago

bugabinga commented 5 months ago

Currently, definition lists get mangled by mdslw. I realize, that there are a lot of markdown extensions out there, but I could not find an explicit statement about scope in the README.md (CommonMark?).

Is this a bug or a feature :) ?

razziel89 commented 5 months ago

Hi and thanks for your interest in mdslw!

The markdown parser used in the back is pulldown-cmark, which parses CommonMark with some extras. Yes, I would say that mdslw aims to support CommonMark at the least. However, I am very happy to support other extensions out there as long as they a) don't mess with the default CommonMark support, and b) are not too horrible to implement. Unfortunately, it does not look like pulldown-cmark supports definition lists.

Looking at definition lists, I'm wondering how best to implement detecting them. Basically, for every new block of text, mdslw would have to check whether all lines apart from the first start with a colon while the first one has to be just text. As soon as one other line does not start with a colon, the block cannot be a definition list, right? Do you know of a simple reference implementation that could be helpful?

Furthermore, what should mdslw do in case some lines in the definition lists are longer than the desired maximum line width? It should probably keep them as they are because, once wrapped, they would no longer be definition lists, right?

In any case, I would consider this an optional feature that could be enabled. What do you think? Any feedback is appreciated :smiley:.

bugabinga commented 5 months ago

Looking at definition lists, I'm wondering how best to implement detecting them. Basically, for every new block of text, mdslw would have to check whether all lines apart from the first start with a colon while the first one has to be just text. As soon as one other line does not start with a colon, the block cannot be a definition list, right? Do you know of a simple reference implementation that could be helpful?

This sounds like it would get messy really fast, especially when other extensions come along. The parser I am using in this case seems to be goldmark via the glow command line tool. https://github.com/yuin/goldmark/blob/master/extension/definition_list.go Definition Lists seem to be one of the extensions enabled by default there.

Furthermore, what should mdslw do in case some lines in the definition lists are longer than the desired maximum line width? It should probably keep them as they are because, once wrapped, they would no longer be definition lists, right?

This is an excellent question! As far as I can tell, the idea is that, similar as with bullet lists, indentation via spaces marks a paragraph. https://michelf.ca/projects/php-markdown/extra/#def-list


While thinking about this some more, I noticed that my issue arose from the fact, that mdslw not only splits sentences, but also collapses whitespace. In my mind, it is predominantly the line splitting which makes the document "easier to diff".

What if there was a command line option to prevent joining of newlines? E.g. mdslw --join never or mdslw --no-join.

razziel89 commented 5 months ago

This sounds like it would get messy really fast, especially when other extensions come along.

Agreed. The number of additional extensions that mdslw can support in the long term is small, mostly due to time constraints.

The parser I am using in this case seems to be goldmark via the glow command line tool. https://github.com/yuin/goldmark/blob/master/extension/definition_list.go Definition Lists seem to be one of the extensions enabled by default there. [...]

This is an excellent question! As far as I can tell, the idea is that, similar as with bullet lists, indentation via spaces marks a paragraph. https://michelf.ca/projects/php-markdown/extra/#def-list

Thanks for those links, they were quite insightful. It seems like definition lists are a bit more free-form than I would have thought initially, meaning that my initial suggestion of parsing them will not work. I do have another idea that would make it very easy to support other extensions that I know, too, but I'm not sure when I'll find the time to test that idea.

While thinking about this some more, I noticed that my issue arose from the fact, that mdslw not only splits sentences, but also collapses whitespace. In my mind, it is predominantly the line splitting which makes the document "easier to diff".

What if there was a command line option to prevent joining of newlines? E.g. mdslw --join never or mdslw --no-join.

That feature is built-in at quite a deep level. I can have a look whether deactivating it is possible, but I'm not sure how hard it would be to implement. I'm also reasonably convinced that the resulting documents would look sub-optimal, in my personal view, which would make it an optional feature. For example this:

Lorem ipsum dolor sit amet. Consectetur adipiscing elit. Sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam.

would become this:

Lorem ipsum dolor sit amet.
Consectetur adipiscing elit.
Sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam.

Is that what you mean? I'm not sure how it would help with definition lists. For example this:

Lorem ipsum
: Dolor sit amet. Consectetur.
: Adipiscing elit. Sed do eiusmod tempor
incididunt.

Ut labore et dolore

: magna aliqua. Ut enim ad minim veniam.

would become this:

Lorem ipsum
: Dolor sit amet.
Consectetur.
: Adipiscing elit.
Sed do eiusmod tempor
incididunt.

Ut labore et dolore

: magna aliqua.
Ut enim ad minim veniam.

I'm not sure whether that's still a valid definition list. What do you think? And thanks for the discussion! I'm always curious to learn about useful markdown extensions.

razziel89 commented 5 months ago

Hi @bugabinga. I've just released version 0.8.0 that contains the new retain-whitespace feature that implements what I understood from our discussion above. With it enabled, mdslw will no longer collapse any whitespace, including line breaks. What do you think?

bugabinga commented 5 months ago

Personally, I find this new feature a big improvement, because I can use it as an "escape hatch" of sorts. Thanks a lot!


In principle, I would much prefer a formatter in the style of gofmt or rustfmt (0 or minimal configuration options), but I do not think that is feasible with regard to all those extensions to markdown out there.

I'm also reasonably convinced that the resulting documents would look sub-optimal, in my personal view, which would make it an optional feature

I agree. I view this feature as a compromise. Even with this feature, your previous critique holds true. Definitions, that require line breaks (e.g. because of line length or end markers), technically still get broken, even with the new feature enabled. The "correct" formatting, I believe, would be: With retain-whitespace:

Lorem ipsum
: Dolor sit amet.
  Consectetur.
: Adipiscing elit.
  Sed do eiusmod tempor
  incididunt.

Ut labore et dolore

: magna aliqua.
  Ut enim ad minim veniam.

Without retain-whitespace:

Lorem ipsum
: Dolor sit amet.
  Consectetur.
: Adipiscing elit.
  Sed do eiusmod tempor incididunt.
Ut labore et dolore
: magna aliqua.
  Ut enim ad minim veniam.

Multi-line definitions would get space-indented, just like multi-line bullet list items do. However, in my case, retain-whitespace works well enough, that I can work around my use cases.


On a different note, the name retain-whitespace seems too general. From what I observe, only newline ever gets retained?

razziel89 commented 5 months ago

Personally, I find this new feature a big improvement, because I can use it as an "escape hatch" of sorts.

Thanks a lot!

You are welcome! It's nice to know that this idea of mine (mdslw) is useful for others, too.

In principle, I would much prefer a formatter in the style of gofmt or rustfmt (0 or minimal configuration options), but I do not think that is feasible with regard to all those extensions to markdown out there.

I fully agree! The default values are currently chosen in such a way that the amount of configuration needed is minimal, or at least that's the intention. My hope is that mdslw can be used by as many people as possible without having to configure anything.

I agree.

I view this feature as a compromise.

Even with this feature, your previous critique holds true.

Definitions, that require line breaks (e.g. because of line length or end markers), technically still get broken, even with the new feature enabled. […] Multi-line definitions would get space-indented, just like multi-line bullet list items do.

The way I understood the docs for definition lists was that indentation was optional (but definitely desirable).

On a different note, the name retain-whitespace seems too general.

From what I observe, only newline ever gets retained?

Enabling the feature should cause all existing white space to be preserved, at least that was the intention. I haven’t tested it extensively, TBH, but enabling that feature circumvents the function that merges all consecutive white space within a paragraph.

bugabinga commented 5 months ago

Enabling the feature should cause all existing white space to be preserved, at least that was the intention.

Ah, that is curious. Upon further testing I noticed, that in general you are right. I just happened to stumble upon an exceptional case, where it still does collapse whitespace.

With enabled feature, the following paragraph

Lorem ipsum
:         Dolor sit amet.
Consectetur.
:           Adipiscing elit.
Sed do eiusmod tempor
incididunt.

Ut labore et dolore

:        magna aliqua.
Ut enim ad minim veniam.

becomes

Lorem ipsum
: Dolor sit amet.
Consectetur.
: Adipiscing elit.
Sed do eiusmod tempor
incididunt.

Ut labore et dolore

: magna aliqua.
Ut enim ad minim veniam.

Still, just for the record; my personal preference would be to retain newlines, but still collapse other whitespace.