Open mattt opened 5 years ago
Some years ago I created a very crude proof-of-concept to do precisely this. It can be tested here: https://waldyrious.github.io/semantic-linebreaker/, hosted directly from the code in https://github.com/waldyrious/semantic-linebreaker/.
It is very primitive as the README (and the code) attests to, and there are issues to be handled, but any help would be greatly appreciated and quickly merged.
(Btw, I'd be happy to move the repo to this organization, if that's desired.)
Thanks so much for sharing that, @waldyrious!
I think regular expressions offer a quick and clever approximation of the kind of line-breaking behavior we're looking for. However, I don't think it's feasible to consistently express semantic boundaries with them.
My current thinking is that a complete solution would probably have to apply an algorithm like Knuth-Plass or Wadler ("prettier printer"), feeding in tokens from a linguistic syntax tree.
Are there any updates on this?
here are some options:
Readable: A promising tool, I just don't like the included formater as it distores other elements in my markdown.
Obsidian Sembr (archived): Another tool that offers semantic line break support for the Obsidian note-taking app.
That readable project indeed looks nice and promising! Thanks for sharing 🙏
That readable project indeed looks nice and promising!
Agreed! For reference, the actual implementation is here.
Does someone have enough Typescript to deactivate the rest of the formater. https://github.com/bobheadxi/readable/issues/30
This is an itch that hasn't been scratched for a long time!
Semantic line breaking is by nature an NLP problem, so I fine-tuned Bert models as token classifiers to predict line breaks (and, surprise 😮, indent levels) on my text and it works reasonably well. I have thus created a tool that uses these models to insert breaks automatically in your text. CUDA (Linux / Windows) or MPS (Mac) acceleration are supported. Currently it works well for LaTeX and plain text, other markup languages are not tested.
The fine-tuned models can be found here on Hugging Face.
Suggestions for improvements and contributions to features, models, or datasets are all welcomed! Feel free to explore and contribute to the project: https://github.com/admk/sembr.
That's playing in another league! Very very interesting project indeed... With the amount of lightweight markup content produced these days, support for markdown, asciidoc and restructuredtext would surely be fantastic!
Awesome work 🤩
Without tooling, a specification like this one can only be descriptive, like a style guide. As a future enhancement, it would be nice to build a tool that automatically inserts line breaks at semantic boundaries for lines that extend beyond a prescribed width (e.g. 80 columns). Like a cross between
prettier
andfold
.