pandoc / lua-filters

A collection of lua filters for pandoc
MIT License
611 stars 166 forks source link

Suggestion: Adding nonbreakablespace lua filter #114

Open Delanii opened 4 years ago

Delanii commented 4 years ago

Hello mr. Tarleb,

with your help, I have finished writing and testing filter that introduces non-breakable space before or after specific strings. If I would prepare informative README.md and add makefile and data to perform tests, would you be interested in adding this filter to this repository?

I tryed to follow the lua-code style recommendations and also added comments that should clarify enough what I am doing (wanting to do).

Next goes final code of the filter:

--[[
Indexed table of one-letter prefixes, after which should be inserted '\160'.
Verbose, but can be changed per user requirements.
--]]

local prefixes = {
  'a',
  'i',
  'k',
  'o',
  's',
  'u',
  'v',
  'z',
  'A',
  'I',
  'K',
  'O',
  'S',
  'U',
  'V',
  'Z'
}

--[[
Some languages (czech among them) require nonbreakable space *before* long dash
--]]

local dashes = {
  '--',
  '–'
}

--[[
Function responsible for searching for one-letter prefixes, after which is 
inserted non-breakable space. Function is short-circuited, that means:

* If it finds match with `prefix` in `prefixes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (prefix wasnt 
found).
--]]

function findOneLetterPrefix(myString)
  for index, prefix in ipairs(prefixes) do
    if myString == prefix then
      return true
    end
  end
  return false
end

--[[
Function responsible for searching for dashes, before whose is inserted 
non-breakable space. Function is short-circuited, that means:

* If it finds match with `dash` in `dashes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (dash wasnt 
found).
--]]

function findDashes(myDash)
  for index, dash in ipairs(dashes) do
    if myDash == dash then
      return true
    end
  end
  return false
end

--[[
Core filter function:

* It iterates over all inline elements in block
* If it finds Space element, uses previously defined functions to find
`prefixes` or `dashes`
* Replaces Space element with `Str '\u{a0}'`, which is non-breakable space 
representation
* Returns modified list of inlines
--]]

function Inlines (inlines)
  for i = 1, #inlines do
    if inlines[i].t == 'Space' then

      -- Check for one-letter prefixes in Str before Space

      if inlines[i - 1].t == 'Str' then
        local oneLetterPrefix = findOneLetterPrefix(inlines[i - 1].c)
        if oneLetterPrefix == true then
--        inlines[i] = pandoc.Str '\xc2\xa0' -- Both work
          inlines[i] = pandoc.Str '\u{a0}'
        end
      end

      -- Check for dashes in Str after Space

      if inlines[i + 1].t == 'Str' then
        local dash = findDashes(inlines[i + 1].c)
        if dash == true then
          inlines[i] = pandoc.Str '\u{a0}'
        end
      end

      -- Check for not fully parsed Str elements - Those might be products of 
      -- other filters, that were executed before this one

      if inlines[i + 1].t == 'Str' then
        if string.match(inlines[i + 1].c, '%.*%s*[„]?%d+[“]?%s*%.*') then
          inlines[i] = pandoc.Str '\u{a0}'
        end
      end

    end

    --[[
    Check for Str containing sequence " prefix ", which might occur in case of
    preceding filter creates it in one Str element. Also check, if quotation
    mark is present introduced by "quotation.lua" filter
    --]]

    if inlines[i].t == 'Str' then
      for index, prefix in ipairs(prefixes) do
        if string.match(inlines[i].c, '%.*%s+[„]?' .. prefix .. '[“]?%s+%.*') then
          front,detection, replacement, back = string.match(inlines[i].c, '(%.*)(%s+[„]?' .. prefix .. '[“]?)(%s+)(%.*)')
          inlines[i].c = front .. detection .. '\u{a0}' .. back
        end
      end
    end

  end
  return inlines
end

Looking forward to you reply.

Regards, Tomas

tarleb commented 4 years ago

Thank you Tomas, I appreciate the offer! Could you tell a bit more about the use-cases of this filter? If I understand correctly, then this is for text written in Czech. I'd like to understand why it is needed, and whether it supports a common typographical convention. If it solves a common problem for Czech writers, then I believe it should fit in.

For the case that this is a less general filter, an alternative would be to host it in you own repository and tag the repo with the pandoc-filter topic to make it discoverable. In that case it should also be mentioned in the pandoc wiki under https://github.com/jgm/pandoc/wiki/Pandoc-Filters.

In either case, I'll be happy to help and provide more feedback.

Delanii commented 4 years ago

Indeed you are correct. This filter is trying to solve common typografy requirement for one-letter words (in case of Czech those are prefixes (or prepositions in language context?) and conjunctions (again in language context, I might be missing correct terms)) never to appear at the end of a line. Also it tryes to add non-breakable space before every en-dash and before every number (to prevent separation of number and its meaning, like "chapter 9" being broken in two lines). It should be noted (I do in README) that this creates some strain on line-breaking patterns, so where possible hyphenation should be allowed.

The functions with regexes inside are trying to find before mentioned patters in strings, that for some reason are not parsed to Strings and Spaces - I have tested that in case there is filter, that does macro expansions or string replacement.

Also, I am trying to detect strings that have different quotation marks inside them - I have found a simple filter proposed by jgm, that changes quotation marks inserted by pandoc to chosen UTF symbols, which sadly produces Strings like Str „text“;

which in such case:

Str "„a" Space "quoted" Space Str "string“"

my filter would not detect the "a" with starting quotation mark. With those regexes it should.

Well, I am not using the official quotations.lua which I maybe should.

The filter is far from perfect, doesnt cover every typografical aspect, and also might require user intervention depending on his language requirements, but I dare to say that it is a good start.

I have tested it in docx and odt formats, which I am targeting mostly for conversion to them from TeX. In LuaTeX and ConTeXt, I am using lua callbacks (post-linebreak-filter), so I have not tested in .tex format, but I expect the Str "\u{a0}" inserts ~ in .tex source.

Some references in this topic (on tex.se):

Using non-breaking space Another typography

Also this issues led to creation of vlna TeX preprocessor (specifically Czech here), lua-vlna package CTAN and ConteXt alternative, and others ...

So the use-case would be general writing with level of typography in mind, that requires conformity with this rule. In Czech, this is widely known, but sometimes neglected (due to docx authoring, which is trying to manage that automatically, but not really ... )

Sure, posting it in my repository is great too, but I dare to say that having any filter accepted here is a kind of quality-assurance, which I would like to achieve (and follow any requirements or recommendations).

Final note: It seems that code formatting little broke; I am using notepad, which automatically introduces tabs instead of spaces. If neccessary, I try to fix that.

tarleb commented 4 years ago

Thanks for the resources, this helped. I agree that the filter is an excellent fit for this repo, and I'll be glad to merge it. Would you like to open a PR?

There are some remaining questions and possible modifications. I apologize beforehand for me being a rather critical reviewer. The strictness is mostly motivated by the fact that I must be able to maintain any filter in case the original author become unavailable and we have to include fixes, or updates to newer pandoc versions. We also try to use a consistent style for the filters.

Thanks!

Delanii commented 4 years ago

I definitely will open a Pull Request then. I have to say, it will be me first time doing that, so please bear with me ... :) I prepare a suitable README, test and makefile. I do understand your requirements, and also value that, because for me being a beginner is kind of easier to follow some guidelines, or rules.

About the first bullet, I did excluded them just on basis in which there are no such one-letter words in Czech language. The filter could be written in such a way, to just prohibit any one-letter word being at the end of the line. But I know about people that want actually to go beyond this rule and even prevent two-letter prefixes being "orphaned" at the end of a line. I thought, that for people like that, I would like to offer easy option to tweak filter behavior.

Second: Oh, OK, I must have seen that somewhere. I fix that.

Third: So after modifying prefixes table as you suggest, I should in for loop in function findOneLetterPrefix (to be renamed) instead of:

for index, prefix in ipairs(prefixes) do

write

for word in prefixes[word] do

Did I get that correctly? As a lua newbie, I have never seen that.

Fourth: OK, I must have missed that. I change that, but I very much prefer camelCase over snake_case; it kinda drew me out of playing with Rust, which compiler is very restrictive even in functions naming.

I will get the modifications done in few days time, currently I am experiencing regular autumn cold, so I will get to it when I will be in full strength again.

bpj commented 4 years ago

if prefixes[word] then -- do what you need to do when word is a prefix end

-- Better --help|less than helpless

Den ons 7 okt. 2020 20:07Delanii notifications@github.com skrev:

I definitely will open a Pull Request then. I have to say, it will be me first time doing that, so please bear with me ... :) I prepare a suitable README, test and makefile. I do understand your requirements, and also value that, because for me being a beginner is kind of easier to follow some guidelines, or rules.

About the first bullet, I did excluded them just on basis in which there are no such one-letter words in Czech language. The filter could be written in such a way, to just prohibit any one-letter word being at the end of the line. But I know about people that want actually to go beyond this rule and even prevent two-letter prefixes being "orphaned" at the end of a line. I thought, that for people like that, I would like to offer easy option to tweak filter behavior.

Second: Oh, OK, I must have seen that somewhere. I fix that.

Third: So after modifying prefixes table as you suggest, I should in for loop in function findOneLetterPrefix (to be renamed) instead of:

for index, prefix in ipairs(prefixes) do

write

for word in prefixes[word] do

Did I get that correctly? As a lua newbie, I have never seen that.

Fourth: OK, I must have missed that. I change that, but I very much prefer camelCase over snake_case; it kinda drew me out of playing with Rust, which compiler is very restrictive even in functions naming.

I will get the modifications done in few days time, currently I am experiencing regular autumn cold, so I will get to it when I will be in full strength again.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pandoc/lua-filters/issues/114#issuecomment-705104268, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI3OU3C5IY3VXSXO76ZH7DSJSUZ7ANCNFSM4SBOIGUQ .

Delanii commented 4 years ago

I have actually found out that the filter does not work for html and latex formats - in that case doesnt insert anything (I was hoping for the unicode sequence to convert to or ~.

I try to fix that.

EDIT: I am still struggling with the suggestion about membership checking. Even with @bpj clarification I am unable to make it work. I have settled with following nonbeakablespace.lua filter:

nonbreakablespace.lua

--[[
Indexed table of one-letter prefixes, after which should be inserted '\160'.
Verbose, but can be changed per user requirements.
--]]

local prefixes = {
  'a',
  'i',
  'k',
  'o',
  's',
  'u',
  'v',
  'z',
  'A',
  'I',
  'K',
  'O',
  'S',
  'U',
  'V',
  'Z'
}

--[[
Some languages (czech among them) require nonbreakable space *before* long dash
--]]

local dashes = {
  '--',
  '–'
}

--[[
Table of replacement elements
--]]

local nonbreakablespaces = {
  html = ' ',
  latex = '~',
  context = '~'
}

--[[
Function responsible for searching for one-letter prefixes, after which is
inserted non-breakable space. Function is short-circuited, that means:

* If it finds match with `prefix` in `prefixes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (prefix wasnt
found).
--]]

function find_one_letter_prefix(my_string)
  for index, prefix in ipairs(prefixes) do
    if my_string == prefix then
      return true
      end
  end
  return false
end

--[[
Function responsible for searching for dashes, before whose is inserted
non-breakable space. Function is short-circuited, that means:

* If it finds match with `dash` in `dashes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (dash wasnt
found).
--]]

function find_dashes(my_dash)
  for index, dash in ipairs(dashes) do
    if my_dash == dash then
      return true
      end
  end
  return false
end

--[[
Function to determine Space element replacement for non-breakable space according to output format
--]]

function insert_nonbreakable_space(format)
  if format == 'html' then
    return pandoc.RawInline('html', nonbreakablespaces.html)
  elseif format:match 'latex' then
    return pandoc.RawInline('tex',nonbreakablespaces.latex)
  elseif format:match 'context' then
    return pandoc.RawInline('tex',nonbreakablespaces.latex)
  else
    --fallback to inserting non-breakable space unicode symbol
    return pandoc.Str '\u{a0}'
  end
end

--[[
Core filter function:

* It iterates over all inline elements in block
* If it finds Space element, uses previously defined functions to find
`prefixes` or `dashes`
* Replaces Space element with `Str '\u{a0}'`, which is non-breakable space
representation
* Returns modified list of inlines
--]]

function Inlines (inlines)

  --variable holding replacement value for the non-breakable space
  local insert = insert_nonbreakable_space(FORMAT)

  for i = 1, #inlines do
    if inlines[i].t == 'Space' then

      -- Check for one-letter prefixes in Str before Space

      if inlines[i - 1].t == 'Str' then
          local one_letter_prefix = find_one_letter_prefix(inlines[i - 1].text)
            if one_letter_prefix == true then
--          inlines[i] = pandoc.Str '\xc2\xa0' -- Both work
          inlines[i] = insert
        end
        end

      -- Check for dashes in Str after Space

        if inlines[i + 1].t == 'Str' then
          local dash = find_dashes(inlines[i + 1].text)
            if dash == true then
              inlines[i] = insert
            end
        end

        -- Check for not fully parsed Str elements - Those might be products of
        -- other filters, that were executed before this one

        if inlines[i + 1].t == 'Str' then
          if string.match(inlines[i + 1].text, '%.*%s*[„]?%d+[“]?%s*%.*') then
              inlines[i] = insert
            end
        end

    end

      --[[
      Check for Str containing sequence " prefix ", which might occur in case of
      preceding filter creates it in one Str element. Also check, if quotation
      mark is present introduced by "quotation.lua" filter
      --]]

      if inlines[i].t == 'Str' then
        for index, prefix in ipairs(prefixes) do
          if string.match(inlines[i].text, '%.*%s+[„]?' .. prefix .. '[“]?%s+%.*') then
              front, detection, replacement, back = string.match(inlines[i].c, '(%.*)(%s+[„]?' .. prefix .. '[“]?)(%s+)(%.*)')
              inlines[i].text = front .. detection .. insert .. back
            end
        end
      end

  end
  return inlines
end

If try following changes:

local prefixes = {
  ['a'] = true,
  ['i'] = true,
  ['k'] = true,
  ['o'] = true,
  ['s'] = true,
  ['u'] = true,
  ['v'] = true,
  ['z'] = true,
  ['A'] = true,
  ['I'] = true,
  ['K'] = true,
  ['O'] = true,
  ['S'] = true,
  ['U'] = true,
  ['V'] = true,
  ['Z'] = true
}

function find_one_letter_prefix(my_string)
  for index, prefix in ipairs(prefixes) do
    if prefixes[prefix] then
      return true
      end
  end
  return false
end

making the whole code to:

--[[
Indexed table of one-letter prefixes, after which should be inserted '\160'.
Verbose, but can be changed per user requirements.
--]]

local prefixes = {
  ['a'] = true,
  ['i'] = true,
  ['k'] = true,
  ['o'] = true,
  ['s'] = true,
  ['u'] = true,
  ['v'] = true,
  ['z'] = true,
  ['A'] = true,
  ['I'] = true,
  ['K'] = true,
  ['O'] = true,
  ['S'] = true,
  ['U'] = true,
  ['V'] = true,
  ['Z'] = true
}

--[[
Some languages (czech among them) require nonbreakable space *before* long dash
--]]

local dashes = {
  '--',
  '–'
}

--[[
Table of replacement elements
--]]

local nonbreakablespaces = {
  html = ' ',
  latex = '~',
  context = '~'
}

--[[
Function responsible for searching for one-letter prefixes, after which is
inserted non-breakable space. Function is short-circuited, that means:

* If it finds match with `prefix` in `prefixes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (prefix wasnt
found).
--]]

function find_one_letter_prefix(my_string)
  for index, prefix in ipairs(prefixes) do
    if prefixes[my_string] then
      return true
      end
  end
  return false
end

--[[
Function responsible for searching for dashes, before whose is inserted
non-breakable space. Function is short-circuited, that means:

* If it finds match with `dash` in `dashes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (dash wasnt
found).
--]]

function find_dashes(my_dash)
  for index, dash in ipairs(dashes) do
    if my_dash == dash then
      return true
      end
  end
  return false
end

--[[
Function to determine Space element replacement for non-breakable space according to output format
--]]

function insert_nonbreakable_space(format)
  if format == 'html' then
    return pandoc.RawInline('html', nonbreakablespaces.html)
  elseif format:match 'latex' then
    return pandoc.RawInline('tex',nonbreakablespaces.latex)
  elseif format:match 'context' then
    return pandoc.RawInline('tex',nonbreakablespaces.latex)
  else
    --fallback to inserting non-breakable space unicode symbol
    return pandoc.Str '\u{a0}'
  end
end

--[[
Core filter function:

* It iterates over all inline elements in block
* If it finds Space element, uses previously defined functions to find
`prefixes` or `dashes`
* Replaces Space element with `Str '\u{a0}'`, which is non-breakable space
representation
* Returns modified list of inlines
--]]

function Inlines (inlines)

  --variable holding replacement value for the non-breakable space
  local insert = insert_nonbreakable_space(FORMAT)

  for i = 1, #inlines do
    if inlines[i].t == 'Space' then

      -- Check for one-letter prefixes in Str before Space

      if inlines[i - 1].t == 'Str' then
          local one_letter_prefix = find_one_letter_prefix(inlines[i - 1].text)
            if one_letter_prefix == true then
--          inlines[i] = pandoc.Str '\xc2\xa0' -- Both work
          inlines[i] = insert
        end
        end

      -- Check for dashes in Str after Space

        if inlines[i + 1].t == 'Str' then
          local dash = find_dashes(inlines[i + 1].text)
            if dash == true then
              inlines[i] = insert
            end
        end

        -- Check for not fully parsed Str elements - Those might be products of
        -- other filters, that were executed before this one

        if inlines[i + 1].t == 'Str' then
          if string.match(inlines[i + 1].text, '%.*%s*[„]?%d+[“]?%s*%.*') then
              inlines[i] = insert
            end
        end

    end

      --[[
      Check for Str containing sequence " prefix ", which might occur in case of
      preceding filter creates it in one Str element. Also check, if quotation
      mark is present introduced by "quotation.lua" filter
      --]]

      if inlines[i].t == 'Str' then
        for index, prefix in ipairs(prefixes) do
          if string.match(inlines[i].text, '%.*%s+[„]?' .. prefix .. '[“]?%s+%.*') then
              front, detection, replacement, back = string.match(inlines[i].c, '(%.*)(%s+[„]?' .. prefix .. '[“]?)(%s+)(%.*)')
              inlines[i].text = front .. detection .. insert .. back
            end
        end
      end

  end
  return inlines
end

It doesnt work, no Space replacement is done for the prefixes. I have tested all variations I could think of, almost blindly, because I am just missing how this concept (idiom) works.

Could you help me with accomodating for this requirement with an simple example? I have tryed to find something on SO or in "Programming in Lua," but I wasnt successfull.

On the other side, I have already all required files prepared - filter file, test file, correct test result and makefile. I have created makefile according to pagebreak makefile. So except for this not-fullfilled requirement I can start PR anytime.