Open makew0rld opened 4 years ago
If you do pandoc -t native
on the docx file, you'll see that pandoc parses it as follows:
[Para [Str "Test",Space,Str "some",Space,Str "text"]
,BulletList
[[BlockQuote
[Para [Str "Bullet",Space,Str "point"]]]
,[BlockQuote
[Para [Str "another",Space,Str "bullet"]]]
,[BlockQuote
[Para [Str "more",Space,Str "text"]]]]
,Para [Str "more",Space,Str "text",Space,Str "afterward"]]
This structure isn't covered by your filter. It looks like our docx reader parses this as blockquotes because of the left indentation.
Thanks for looking into this. There are actually more steps then I mentioned. The document was created in Google Docs using the bullet list button, and then downloaded as a docx file. I don't have Word installed, but when I open the document in LibreOffice it shows up correctly as a bullet list, and it's not possible to de-indent the bullet points at all.
Is this a bug in Pandoc's docx reader then?
Also, do you think you could help construct a filter that produces the desired result? Thank you.
Yes, I'd say this is a problem with the docx reader's detection of block quotes, especially if Google Docs defaults to adding indentation to lists. @jkr what do you think? You can try posting on pandoc-discuss for filter help.
I'd say this is a problem with the docx reader
Okay, good to know thank you.
Google Docs defaults to adding indentation to lists
I can confirm this is a specific Google Docs issue. I have added more bullet points to the docx file, but this time using the LibreOffice bullet button. You can download the new file here.
Here is the markdown output using the second filter:
Test some text
- Bullet point
- another bullet
- more text
more text afterward
- bullet point
- in libreoffice
- testing again
And here is the native output (again using the second filter):
[Para [Str "Test",Space,Str "some",Space,Str "text"]
,BulletList
[[BlockQuote
[Para [Str "Bullet",Space,Str "point"]]]
,[BlockQuote
[Para [Str "another",Space,Str "bullet"]]]
,[BlockQuote
[Para [Str "more",Space,Str "text"]]]]
,Para [Str "more",Space,Str "text",Space,Str "afterward"]
,BulletList
[[Plain [Str "bullet",Space,Str "point"]]
,[Plain [Str "in",Space,Str "libreoffice"]]
,[Plain [Str "testing",Space,Str "again"]]]]
It appears that LibreOffice is adding bullet points correctly, but Google Docs is putting them in a block quote.
By the way, here's the code that treats indentation as a block quote:
| Just left <- indentation pPr >>= leftParIndent -> do
let pPr' = pPr { indentation = Nothing }
hang = fromMaybe 0 $ indentation pPr >>= hangingParIndent
transform <- parStyleToTransform pPr'
return $ if (left - hang) > 0
then blockQuote . transform
else transform
| otherwise -> return id
Text.Pandoc.Readers.Docx at l. 534. It would be quite easy to remove this clause, and I'm tempted to do that, but it would be good to hear from @jkr, who may have had good reasons for putting this there. Without this, BlockQuote would be triggered only by Quote or BlockQuote styles.
Thanks, would be happy to see this removed if it's appropriate. In the meantime I'll try and figure out a filter.
For anyone else who finds this thread, I've updated the second filter I mentioned in my original comment to handle this as well.
-- Source: https://stackoverflow.com/a/57943159/7361270
-- Modified by makeworld
-- Iterate over all blocks in an item, converting 'top-level'
-- Para into Plain blocks.
function compactifyItem (blocks)
-- step through the list of blocks step-by-step, keeping track of the
-- element's index in the list in variable `i`, and assign the current
-- block to `blk`.
--
for i, blk in ipairs(blocks) do
if blk.t == 'Para' then
-- update in item's block list.
blocks[i] = pandoc.Plain(blk.content)
elseif blk.t == 'BlockQuote' then
-- It's a Google Doc thing, where each bullet is in a blockquote
-- https://github.com/jgm/pandoc/issues/6824
blocks[i] = pandoc.Plain(blk.content[1].content)
end
end
return blocks
end
function compactifyList (l)
-- l.content is an instance of pandoc.List, so the following is equivalent
-- to pandoc.List.map(l.content, compactifyItem)
l.content = l.content:map(compactifyItem)
return l
end
return {{
BulletList = compactifyList,
OrderedList = compactifyList
}}
I was previously using a Lua filter to make sure Pandoc always outputted compact lists (source):
That no longer works. I also tried this filter (source):
This did not work either. Is this a bug in Pandoc, or has the API changed? Help getting this to work would be appreciated.
Reproducing
Here is an input docx file. Here is the command I run and output:
The output I am expecting: