sembr / specification

Semantic Line Breaks Specification
https://sembr.org
75 stars 5 forks source link

Create a Tool to Automatically Insert Semantic Breaks for Overflowing Lines #1

Open mattt opened 5 years ago

mattt commented 5 years ago

Without tooling, a specification like this one can only be descriptive, like a style guide. As a future enhancement, it would be nice to build a tool that automatically inserts line breaks at semantic boundaries for lines that extend beyond a prescribed width (e.g. 80 columns). Like a cross between prettier and fold.

waldyrious commented 5 years ago

Some years ago I created a very crude proof-of-concept to do precisely this. It can be tested here: https://waldyrious.github.io/semantic-linebreaker/, hosted directly from the code in https://github.com/waldyrious/semantic-linebreaker/.

It is very primitive as the README (and the code) attests to, and there are issues to be handled, but any help would be greatly appreciated and quickly merged.

waldyrious commented 5 years ago

(Btw, I'd be happy to move the repo to this organization, if that's desired.)

mattt commented 5 years ago

Thanks so much for sharing that, @waldyrious!

I think regular expressions offer a quick and clever approximation of the kind of line-breaking behavior we're looking for. However, I don't think it's feasible to consistently express semantic boundaries with them.

My current thinking is that a complete solution would probably have to apply an algorithm like Knuth-Plass or Wadler ("prettier printer"), feeding in tokens from a linguistic syntax tree.

SilasK commented 12 months ago

Are there any updates on this?

SilasK commented 10 months ago

here are some options:

silopolis commented 10 months ago

That readable project indeed looks nice and promising! Thanks for sharing 🙏

waldyrious commented 10 months ago

That readable project indeed looks nice and promising!

Agreed! For reference, the actual implementation is here.

SilasK commented 10 months ago

Does someone have enough Typescript to deactivate the rest of the formater. https://github.com/bobheadxi/readable/issues/30

admk commented 10 months ago

This is an itch that hasn't been scratched for a long time!

Semantic line breaking is by nature an NLP problem, so I fine-tuned Bert models as token classifiers to predict line breaks (and, surprise 😮, indent levels) on my text and it works reasonably well. I have thus created a tool that uses these models to insert breaks automatically in your text. CUDA (Linux / Windows) or MPS (Mac) acceleration are supported. Currently it works well for LaTeX and plain text, other markup languages are not tested.

The fine-tuned models can be found here on Hugging Face.

Suggestions for improvements and contributions to features, models, or datasets are all welcomed! Feel free to explore and contribute to the project: https://github.com/admk/sembr.

silopolis commented 10 months ago

That's playing in another league! Very very interesting project indeed... With the amount of lightweight markup content produced these days, support for markdown, asciidoc and restructuredtext would surely be fantastic!

Awesome work 🤩