Open Delanii opened 4 years ago
Thank you Tomas, I appreciate the offer! Could you tell a bit more about the use-cases of this filter? If I understand correctly, then this is for text written in Czech. I'd like to understand why it is needed, and whether it supports a common typographical convention. If it solves a common problem for Czech writers, then I believe it should fit in.
For the case that this is a less general filter, an alternative would be to host it in you own repository and tag the repo with the pandoc-filter
topic to make it discoverable. In that case it should also be mentioned in the pandoc wiki under https://github.com/jgm/pandoc/wiki/Pandoc-Filters.
In either case, I'll be happy to help and provide more feedback.
Indeed you are correct.
This filter is trying to solve common typografy requirement for one-letter words (in case of Czech those are prefixes (or prepositions in language context?) and conjunctions (again in language context, I might be missing correct terms)) never to appear at the end of a line. Also it tryes to add non-breakable space before every en-dash and before every number (to prevent separation of number and its meaning, like "chapter 9" being broken in two lines). It should be noted (I do in README
) that this creates some strain on line-breaking patterns, so where possible hyphenation should be allowed.
The functions with regexes inside are trying to find before mentioned patters in strings, that for some reason are not parsed to Strings
and Spaces
- I have tested that in case there is filter, that does macro expansions or string replacement.
Also, I am trying to detect strings that have different quotation marks inside them - I have found a simple filter proposed by jgm, that changes quotation marks inserted by pandoc to chosen UTF symbols, which sadly produces Strings like Str „text“
;
which in such case:
Str "„a" Space "quoted" Space Str "string“"
my filter would not detect the "a" with starting quotation mark. With those regexes it should.
Well, I am not using the official quotations.lua
which I maybe should.
The filter is far from perfect, doesnt cover every typografical aspect, and also might require user intervention depending on his language requirements, but I dare to say that it is a good start.
I have tested it in docx
and odt
formats, which I am targeting mostly for conversion to them from TeX. In LuaTeX and ConTeXt, I am using lua callbacks (post-linebreak-filter
), so I have not tested in .tex
format, but I expect the Str "\u{a0}"
inserts ~
in .tex
source.
Some references in this topic (on tex.se
):
Using non-breaking space Another typography
Also this issues led to creation of vlna
TeX preprocessor (specifically Czech here), lua-vlna
package CTAN and ConteXt alternative, and others ...
So the use-case would be general writing with level of typography in mind, that requires conformity with this rule. In Czech, this is widely known, but sometimes neglected (due to docx
authoring, which is trying to manage that automatically, but not really ... )
Sure, posting it in my repository is great too, but I dare to say that having any filter accepted here is a kind of quality-assurance, which I would like to achieve (and follow any requirements or recommendations).
Final note: It seems that code formatting little broke; I am using notepad, which automatically introduces tabs instead of spaces. If neccessary, I try to fix that.
Thanks for the resources, this helped. I agree that the filter is an excellent fit for this repo, and I'll be glad to merge it. Would you like to open a PR?
There are some remaining questions and possible modifications. I apologize beforehand for me being a rather critical reviewer. The strictness is mostly motivated by the fact that I must be able to maintain any filter in case the original author become unavailable and we have to include fixes, or updates to newer pandoc versions. We also try to use a consistent style for the filters.
í
or š
. Is that correct? Some answers in the linked tex.se Q/A appear to place nbsp even after those letters, while most don't. I assume you excluded those letters from prefixes
on purpose?.c
to access an elements contents is not officially supported and might break in future versions. Better to use .text
when accessing Str
contents.local prefixes = {['a'] = true, ['z'] = true}
; this allows us to check set membership by running prefixes[word]
.snake_case
instead of camelCase
for most names. We are not super strict about it, but it would be nice to become more consistent across the codebase.Thanks!
I definitely will open a Pull Request then. I have to say, it will be me first time doing that, so please bear with me ... :) I prepare a suitable README, test and makefile. I do understand your requirements, and also value that, because for me being a beginner is kind of easier to follow some guidelines, or rules.
About the first bullet, I did excluded them just on basis in which there are no such one-letter words in Czech language. The filter could be written in such a way, to just prohibit any one-letter word being at the end of the line. But I know about people that want actually to go beyond this rule and even prevent two-letter prefixes being "orphaned" at the end of a line. I thought, that for people like that, I would like to offer easy option to tweak filter behavior.
Second: Oh, OK, I must have seen that somewhere. I fix that.
Third: So after modifying prefixes
table as you suggest, I should in for
loop in function findOneLetterPrefix
(to be renamed) instead of:
for index, prefix in ipairs(prefixes) do
write
for word in prefixes[word] do
Did I get that correctly? As a lua newbie, I have never seen that.
Fourth: OK, I must have missed that. I change that, but I very much prefer camelCase
over snake_case
; it kinda drew me out of playing with Rust, which compiler is very restrictive even in functions naming.
I will get the modifications done in few days time, currently I am experiencing regular autumn cold, so I will get to it when I will be in full strength again.
if prefixes[word] then -- do what you need to do when word is a prefix end
-- Better --help|less than helpless
Den ons 7 okt. 2020 20:07Delanii notifications@github.com skrev:
I definitely will open a Pull Request then. I have to say, it will be me first time doing that, so please bear with me ... :) I prepare a suitable README, test and makefile. I do understand your requirements, and also value that, because for me being a beginner is kind of easier to follow some guidelines, or rules.
About the first bullet, I did excluded them just on basis in which there are no such one-letter words in Czech language. The filter could be written in such a way, to just prohibit any one-letter word being at the end of the line. But I know about people that want actually to go beyond this rule and even prevent two-letter prefixes being "orphaned" at the end of a line. I thought, that for people like that, I would like to offer easy option to tweak filter behavior.
Second: Oh, OK, I must have seen that somewhere. I fix that.
Third: So after modifying prefixes table as you suggest, I should in for loop in function findOneLetterPrefix (to be renamed) instead of:
for index, prefix in ipairs(prefixes) do
write
for word in prefixes[word] do
Did I get that correctly? As a lua newbie, I have never seen that.
Fourth: OK, I must have missed that. I change that, but I very much prefer camelCase over snake_case; it kinda drew me out of playing with Rust, which compiler is very restrictive even in functions naming.
I will get the modifications done in few days time, currently I am experiencing regular autumn cold, so I will get to it when I will be in full strength again.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pandoc/lua-filters/issues/114#issuecomment-705104268, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI3OU3C5IY3VXSXO76ZH7DSJSUZ7ANCNFSM4SBOIGUQ .
I have actually found out that the filter does not work for html
and latex
formats - in that case doesnt insert anything (I was hoping for the unicode sequence to convert to ~
.
I try to fix that.
EDIT: I am still struggling with the suggestion about membership checking. Even with @bpj clarification I am unable to make it work. I have settled with following nonbeakablespace.lua
filter:
nonbreakablespace.lua
--[[
Indexed table of one-letter prefixes, after which should be inserted '\160'.
Verbose, but can be changed per user requirements.
--]]
local prefixes = {
'a',
'i',
'k',
'o',
's',
'u',
'v',
'z',
'A',
'I',
'K',
'O',
'S',
'U',
'V',
'Z'
}
--[[
Some languages (czech among them) require nonbreakable space *before* long dash
--]]
local dashes = {
'--',
'–'
}
--[[
Table of replacement elements
--]]
local nonbreakablespaces = {
html = ' ',
latex = '~',
context = '~'
}
--[[
Function responsible for searching for one-letter prefixes, after which is
inserted non-breakable space. Function is short-circuited, that means:
* If it finds match with `prefix` in `prefixes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (prefix wasnt
found).
--]]
function find_one_letter_prefix(my_string)
for index, prefix in ipairs(prefixes) do
if my_string == prefix then
return true
end
end
return false
end
--[[
Function responsible for searching for dashes, before whose is inserted
non-breakable space. Function is short-circuited, that means:
* If it finds match with `dash` in `dashes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (dash wasnt
found).
--]]
function find_dashes(my_dash)
for index, dash in ipairs(dashes) do
if my_dash == dash then
return true
end
end
return false
end
--[[
Function to determine Space element replacement for non-breakable space according to output format
--]]
function insert_nonbreakable_space(format)
if format == 'html' then
return pandoc.RawInline('html', nonbreakablespaces.html)
elseif format:match 'latex' then
return pandoc.RawInline('tex',nonbreakablespaces.latex)
elseif format:match 'context' then
return pandoc.RawInline('tex',nonbreakablespaces.latex)
else
--fallback to inserting non-breakable space unicode symbol
return pandoc.Str '\u{a0}'
end
end
--[[
Core filter function:
* It iterates over all inline elements in block
* If it finds Space element, uses previously defined functions to find
`prefixes` or `dashes`
* Replaces Space element with `Str '\u{a0}'`, which is non-breakable space
representation
* Returns modified list of inlines
--]]
function Inlines (inlines)
--variable holding replacement value for the non-breakable space
local insert = insert_nonbreakable_space(FORMAT)
for i = 1, #inlines do
if inlines[i].t == 'Space' then
-- Check for one-letter prefixes in Str before Space
if inlines[i - 1].t == 'Str' then
local one_letter_prefix = find_one_letter_prefix(inlines[i - 1].text)
if one_letter_prefix == true then
-- inlines[i] = pandoc.Str '\xc2\xa0' -- Both work
inlines[i] = insert
end
end
-- Check for dashes in Str after Space
if inlines[i + 1].t == 'Str' then
local dash = find_dashes(inlines[i + 1].text)
if dash == true then
inlines[i] = insert
end
end
-- Check for not fully parsed Str elements - Those might be products of
-- other filters, that were executed before this one
if inlines[i + 1].t == 'Str' then
if string.match(inlines[i + 1].text, '%.*%s*[„]?%d+[“]?%s*%.*') then
inlines[i] = insert
end
end
end
--[[
Check for Str containing sequence " prefix ", which might occur in case of
preceding filter creates it in one Str element. Also check, if quotation
mark is present introduced by "quotation.lua" filter
--]]
if inlines[i].t == 'Str' then
for index, prefix in ipairs(prefixes) do
if string.match(inlines[i].text, '%.*%s+[„]?' .. prefix .. '[“]?%s+%.*') then
front, detection, replacement, back = string.match(inlines[i].c, '(%.*)(%s+[„]?' .. prefix .. '[“]?)(%s+)(%.*)')
inlines[i].text = front .. detection .. insert .. back
end
end
end
end
return inlines
end
If try following changes:
local prefixes = {
['a'] = true,
['i'] = true,
['k'] = true,
['o'] = true,
['s'] = true,
['u'] = true,
['v'] = true,
['z'] = true,
['A'] = true,
['I'] = true,
['K'] = true,
['O'] = true,
['S'] = true,
['U'] = true,
['V'] = true,
['Z'] = true
}
function find_one_letter_prefix(my_string)
for index, prefix in ipairs(prefixes) do
if prefixes[prefix] then
return true
end
end
return false
end
making the whole code to:
--[[
Indexed table of one-letter prefixes, after which should be inserted '\160'.
Verbose, but can be changed per user requirements.
--]]
local prefixes = {
['a'] = true,
['i'] = true,
['k'] = true,
['o'] = true,
['s'] = true,
['u'] = true,
['v'] = true,
['z'] = true,
['A'] = true,
['I'] = true,
['K'] = true,
['O'] = true,
['S'] = true,
['U'] = true,
['V'] = true,
['Z'] = true
}
--[[
Some languages (czech among them) require nonbreakable space *before* long dash
--]]
local dashes = {
'--',
'–'
}
--[[
Table of replacement elements
--]]
local nonbreakablespaces = {
html = ' ',
latex = '~',
context = '~'
}
--[[
Function responsible for searching for one-letter prefixes, after which is
inserted non-breakable space. Function is short-circuited, that means:
* If it finds match with `prefix` in `prefixes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (prefix wasnt
found).
--]]
function find_one_letter_prefix(my_string)
for index, prefix in ipairs(prefixes) do
if prefixes[my_string] then
return true
end
end
return false
end
--[[
Function responsible for searching for dashes, before whose is inserted
non-breakable space. Function is short-circuited, that means:
* If it finds match with `dash` in `dashes` table, then it returns `true`.
* Otherwise, after the iteration is finished, returns `false` (dash wasnt
found).
--]]
function find_dashes(my_dash)
for index, dash in ipairs(dashes) do
if my_dash == dash then
return true
end
end
return false
end
--[[
Function to determine Space element replacement for non-breakable space according to output format
--]]
function insert_nonbreakable_space(format)
if format == 'html' then
return pandoc.RawInline('html', nonbreakablespaces.html)
elseif format:match 'latex' then
return pandoc.RawInline('tex',nonbreakablespaces.latex)
elseif format:match 'context' then
return pandoc.RawInline('tex',nonbreakablespaces.latex)
else
--fallback to inserting non-breakable space unicode symbol
return pandoc.Str '\u{a0}'
end
end
--[[
Core filter function:
* It iterates over all inline elements in block
* If it finds Space element, uses previously defined functions to find
`prefixes` or `dashes`
* Replaces Space element with `Str '\u{a0}'`, which is non-breakable space
representation
* Returns modified list of inlines
--]]
function Inlines (inlines)
--variable holding replacement value for the non-breakable space
local insert = insert_nonbreakable_space(FORMAT)
for i = 1, #inlines do
if inlines[i].t == 'Space' then
-- Check for one-letter prefixes in Str before Space
if inlines[i - 1].t == 'Str' then
local one_letter_prefix = find_one_letter_prefix(inlines[i - 1].text)
if one_letter_prefix == true then
-- inlines[i] = pandoc.Str '\xc2\xa0' -- Both work
inlines[i] = insert
end
end
-- Check for dashes in Str after Space
if inlines[i + 1].t == 'Str' then
local dash = find_dashes(inlines[i + 1].text)
if dash == true then
inlines[i] = insert
end
end
-- Check for not fully parsed Str elements - Those might be products of
-- other filters, that were executed before this one
if inlines[i + 1].t == 'Str' then
if string.match(inlines[i + 1].text, '%.*%s*[„]?%d+[“]?%s*%.*') then
inlines[i] = insert
end
end
end
--[[
Check for Str containing sequence " prefix ", which might occur in case of
preceding filter creates it in one Str element. Also check, if quotation
mark is present introduced by "quotation.lua" filter
--]]
if inlines[i].t == 'Str' then
for index, prefix in ipairs(prefixes) do
if string.match(inlines[i].text, '%.*%s+[„]?' .. prefix .. '[“]?%s+%.*') then
front, detection, replacement, back = string.match(inlines[i].c, '(%.*)(%s+[„]?' .. prefix .. '[“]?)(%s+)(%.*)')
inlines[i].text = front .. detection .. insert .. back
end
end
end
end
return inlines
end
It doesnt work, no Space replacement is done for the prefixes
. I have tested all variations I could think of, almost blindly, because I am just missing how this concept (idiom) works.
Could you help me with accomodating for this requirement with an simple example? I have tryed to find something on SO or in "Programming in Lua," but I wasnt successfull.
On the other side, I have already all required files prepared - filter file, test file, correct test result and makefile
. I have created makefile
according to pagebreak makefile
. So except for this not-fullfilled requirement I can start PR anytime.
Hello mr. Tarleb,
with your help, I have finished writing and testing filter that introduces non-breakable space before or after specific strings. If I would prepare informative
README.md
and addmakefile
and data to perform tests, would you be interested in adding this filter to this repository?I tryed to follow the lua-code style recommendations and also added comments that should clarify enough what I am doing (wanting to do).
Next goes final code of the filter:
Looking forward to you reply.
Regards, Tomas