pandoc / lua-filters

A collection of lua filters for pandoc
MIT License
603 stars 165 forks source link

Adding non-breakable spaces lua filter #119

Open Delanii opened 3 years ago

Delanii commented 3 years ago

This adresses #114 .

Its first pull request I have ever made, so hopefully I did not messed anything up. I know it is a whole bunch of files, which should be avoided, but I think that in this case it should be put-and-packaged as one ...

Per #114 it still has not implemented the change in tables and presence-check, since I am having trouble understanding that syntax (idiom) and using it. Hopefully it doesnt mind too much.

Regards, Tomas

Delanii commented 3 years ago

OK, it seems that somehow I did not upload the makefile?

I have added makefile to my fork, which looks like added the makefile also here ... Well, I have to learn about git and GitHub :D

alerque commented 3 years ago

On GitHub a Pull Request is a request to merge one branch into another, in this case your fork's master branch into this repository's. You can update PR's by adding, removing, or rebasing commits in a branch. Typically (and to get more GitHub tooling) you'd open a PR from a branch on your fork other than master. In that case maintainers of the target project can also make changes to the PR before merging. As it is only you can, but anything you do in your master branch will continue to change this PR until it is merged or closed.

Delanii commented 3 years ago

Oh, I wasnt sure if I should create new branch in my fork or not. Next time I do better. Thank you for clarification.

Do I understand you correctly, that now nobody (except me) can make edit to this PR? I thought, that the "Allow edits by maintainers" checker is to control this. I have it checked, which I hoped is for making it possible for project maintainers to correct any errors in this PR (well, if they prefer that over telling me what should I fix).

alerque commented 3 years ago

I thought, that the "Allow edits by maintainers" checker is to control this. I have it checked

Really? Last I checked (less than a week ago) that checkbox was disabled for all PR's from master branches. If you see the box and have it checked then GitHub has updated something recently, and that's a nice fix that I'm glad to hear about.

Delanii commented 3 years ago

I have used this manual:

Pull request from a fork

Step no. 7 is the one which we are talking about, I hope. Except I did the exact opposite: Left it checked to allow maintainers to do with this PR anything they deem neccessary ...

Btw. It seems that bibliography test are misbehaving; maybe because of newest update?

stroobandt commented 3 years ago

Interesting. Today, I was looking at some legacy production code with the exact same purpose; thinking by myself that I better replace it by a Lua filter. The current code is a monster of a nested sed that lives in my makefile.

The sed command creates secondary sed substitution commands from a list of so called short begin words. These secondary sed commands will replace the ordinary space after a short begin word at the beginning of a sentence with a non-breaking space.

The list of short begin words is language dependent. Here are the English and Dutch lists.

Finally, we do not want such substitutions to occur in <PRE>-formatted code blocks. That is where the /```/,/```/! is for.

Delanii commented 3 years ago

@stroobandt Thank you for posting that!

Sadly, I am not enough versed in makefiles or sed to make much of it, but from the lists I see that you went in simillar direction. I am actually not sure how to implement different behavior according to language (well, I would have to read somehow the metadata variable lang).

If you are concerned about code-block (I was too) I have tested that (and it is even in test file of the PR) - codeblocks and inline code is recognized well and no space substitution happens there.

tarleb commented 3 years ago

Thanks for the updates! I'll take a closer look soon, I just need to fix the tests first. This may take a while, as I'm going to spend less time in front of the screen for the next couple of days.

The filter name "nonbreakingspace" is a bit generic and somewhat cryptic. Can we find something more specific? How about using the same name as the TeX package, "vlna"? Would that be too confusing? Maybe "west-slavic-nbsp"? I'm open for ideas.

Delanii commented 3 years ago

Well, "pandoc-vlna" in itself could be good, but I would think that it would introduce different "crypticity;" that is, that LaTeX packages will be known only to LaTeX users; and most probably to the subset that doesnt already used theyre own homemade solutions (I would guess that those with homemade solutions will be actually plenty).

I thought that naming the filter by what it does as much as possible will prevent its name from being misleading. Very often I am taking inspiration from lua filter pagebreak - which does exactly what it is named after: inserts pagebreak.

But in the spirit of your previous comment, also "insert-nbsp" (or no-abbreviated form "insert-nonbreakable-space") could work too. I would like to aviod specifying any nationalities or geografic locations to the filter, because it might be offputting to someone (or even any way worse). The code state is mostly caused by me being lua newbie and natively Czech, which leads me to make the default setting suitable for Czech language requirements. I hope that the filter as it is is customizable enough for anybody to make changes according to his native language requirements.

If anybody would know how to make this filter conformant to more than one language, I would be happy to make more improvements. In this regard, my only idea is to have multiple (well, depending on how many languages a lot) tables for each language and then (somehow) read metadata lang variable. But that is highly beyond my linguistic and pandoc-hacking skills ...

@tarleb Let me know which identification is more suitable:

At the end of the day, as famous classic says:

"A rose by any other name would smell as sweet"

... as long as its well described in README.md (hopefully it is).

Regards, Tomas

tarleb commented 3 years ago

I'd like to ask a favor. I compiled a new contributing guide over the last two weeks and would like to get some feedback on it. I'd appreciate if you could take a look at the document and tell me about all passages that are unclear or that you feel should be improved for other reasons.

I am not commenting on the code yet, as I'm also trying to evaluate if the information I added to the guide is sufficient to prevent some common issues which often come up during code review. Hope that's ok with you?

As for the name: each options seems fine. Since Czech is the only supported language, we could also be more explicit about the filter's purpose and include czeck in the name. I'll leave the choice to you.

Delanii commented 3 years ago

Considering the new guide, everything mentioned seems pretty completely clear and OK to me. I can accomodate for the ammount of characters per line. Editorconfig seems like a great suggestion, although I am not familiar with that utility. I take another look at the diagram-generator filter and try to accomodate for setting this fiter according to lang metadata value. I can try to make another set of rules for english (and make that actually default). That could even make it valid to name the filter pandoc-vlna (well, in my opinion). I let you know when all changes are finished (could be pretty fast).

Delanii commented 3 years ago

@tarleb I have finished changes according to the newest contributing guide. All should be good to go! Only thing I realized that in case of Spans or Divs it doesnt check, if those have set theyre lang (attribute?) differently. It is something I can work on when there will be more languages supported (for two languages it might not make much sense). The only thing is that I am having trouble changing the directory name itself from nonbreakablespace to pandocVlna (now I dare to name it that way) ... :)

Delanii commented 3 years ago

Added another update with all improvements suggested in review. Also updated filter file header according to new contribution guide.

netique commented 3 years ago

Many thanks for this filter, @Delanii! I wonder if it is possible to add the % character to the list. In Czech, we write non-breakable space between the number and % sign (when percentages are used as an adjective, there is no space). I've tried to change the filter, but failed miserably. Maybe % needs to be escaped as \%, but that fails too (as well as \\% or \\\% does). Help would be greatly appreciated!

badumont commented 3 years ago

Just a side note: in lua regexp, % is the escape char, so you have to type %% so that it escapes itself.

I wonder if this filter could be made more general by creating some metadata fields taking a list a characters to be associated with a non-breakable space. Something like "non-breakable-space-before", "non-breakable-space-after", "thin-space-before", "thin-space-after". Of course it would be complementary to the built-in configuration files per language that you mentioned. What are your thoughts about that?

Le Thursday 06 May 2021 à 03:31:12AM, Jan Netík a écrit :

Many thanks for this filter, @Delanii! I wonder if it is possible to add the % character to the list. In Czech, we write non-breakable space between the number and % sign (when percentages are used as an adjective, there is no space). I've tried to change the filter, but failed miserably. Maybe % needs to be escaped as \%, but that fails too (as well as \% or \\% does). Help would be greatly appreciated!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.*

Delanii commented 3 years ago

Many thanks for this filter, @Delanii! I wonder if it is possible to add the % character to the list. In Czech, we write non-breakable space between the number and % sign (when percentages are used as an adjective, there is no space). I've tried to change the filter, but failed miserably. Maybe % needs to be escaped as \%, but that fails too (as well as \\% or \\\% does). Help would be greatly appreciated!

Hello @netique ,

well, this filter doesnt cover it all. I admit I have settled up to dealing with things that were the easiest -- just list few prepositions and deal with split numbers. The percent character could be thought about as a unit, which has actually the opposite requirement -- has to be coupled with preceeding text element (number, as in "10 %"), not with the following one (word, as in "10 apples").

As I wrote in the beginning, I was and still am a lua newbie, so by cutting off units, chapters (ex: "chapter 9") and such, I have made creating this filter a lot easier.

I am not very sure how I would deal with units and such in this filter. In LaTeX this is being dealt with by means of famous siunitx package, so I could take a peek at its code and try to inspire myself, but I think that that would require actually adding extra markup that would be recognized by filter, something like @unit{5}{%} or @unit{10}{m/h}, but I am not sure if everybody would wont to write so much markup just to have nonbreakable spaces between number and unit.

Delanii commented 3 years ago

Just a side note: in lua regexp, % is the escape char, so you have to type %% so that it escapes itself. I wonder if this filter could be made more general by creating some metadata fields taking a list a characters to be associated with a non-breakable space. Something like "non-breakable-space-before", "non-breakable-space-after", "thin-space-before", "thin-space-after". Of course it would be complementary to the built-in configuration files per language that you mentioned. What are your thoughts about that? Le Thursday 06 May 2021 à 03:31:12AM, Jan Netík a écrit :

Hi @badumont !

Well, that is a great idea, I was actually thinking about something like that in the beginning, but then I thought/hoped that this filter could grow by feedback to incorporate more languages with their respective rules. I wanted the filter to be as simple to use as possible so there is no configuration. But the code is open as well, so you can modify it to your respective needs (by just modifying the table with prepositions) to tweak the behaviour as you need.

I also have no idea how to implement configuration of a lua filter. And how would that be stated (configuration file, default file, stating at command line??)

Do you have an idea how to do that?

badumont commented 3 years ago

I am not an expert either, but here is my guess.

If you wish to support multiple languages, I think that the filter itself should contain only general-purpose functions, not language-related data, exactly what you do in l. 117-140. So we could imagine an Inlines function that only calls functions like: replace-space-by-nbsp and replace-space-by-thin (if you expect from the user that he or she types a normal space where the nbsp should be, which of course is debatable). Every function would then read a table containing the characters that need a nbsp.

For the configuration, there could be two non-exclusive ways: the one that I mentioned, and another based on configuration files per language. The easiest way from the programmer's point of view is to write configuration files as Lua tables, so that they can be directly imported in the lua filter with dofile. It could look like this (simplified for French):

local nbsp_before = {
  ':',
  '»'
}

local nbsp_after = {
  '«'
}

local thin_before = {
  ';'
}

spaces_config = {
  nbsp_before = nbsp_before,
  nbsp_after = nbsp_after,
  thin_before = thin_before
}

Then simply add dofile "config/FR_config.lua" near the beginning of the file. Then you will be able to get the list of the characters that require a thin space before it with spaces_config.thin_before. Of course, these "characters" can be whole strings like the. That would only require to map the supported IETF language tags to the corresponding configuration files before calling dofile. Such configuration file could be easily written by users by modifying an existing one, I think.

Another thing: why do you define the nbsp character per format? Pandoc does the conversion itself (try pandoc -t latex -f html <<< '&nbsp').

I can help if you want. As a French user who currently selects manually every special space, I am very interested in this filter! However, in that case, it may be easier to do it on a separate repo for I already have a fork of lua-filters with a pending pull request...

hugoroy commented 3 years ago

Hi both I am very interested in this as well. I currently do this with a bash script as filter and use sed. Not very clean :-)

What about specifying functions depending on the language set?

For instance if meta.lang = fr then we may provide a list of functions to replace the non-breakable spaces with the right non-breakable space that should appear in French.

Edit: Okay I just saw how this is implemented.