jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.7k stars 3.39k forks source link

Filters that would compactify markdown list output no longer work #6824

Open makew0rld opened 4 years ago

makew0rld commented 4 years ago
pandoc 2.11
Compiled with pandoc-types 1.22, texmath 0.12.0.3, skylighting 0.10

I was previously using a Lua filter to make sure Pandoc always outputted compact lists (source):

local List = require 'pandoc.List'

function compactifyItem (blocks)
  return (#blocks == 1 and blocks[1].t == 'Para')
    and {pandoc.Plain(blocks[1].content)}
    or blocks
end

function compactifyList (l)
  l.content = List.map(l.content, compactifyItem)
  return l
end

return {{
    BulletList = compactifyList,
    OrderedList = compactifyList
}}

That no longer works. I also tried this filter (source):

--- Iterate over all blocks in an item, converting 'top-level'
-- Para into Plain blocks.
function compactifyItem (blocks)
  -- step through the list of blocks step-by-step, keeping track of the
  -- element's index in the list in variable `i`, and assign the current
  -- block to `blk`.
  -- 
  for i, blk in ipairs(blocks) do
    if blk.t == 'Para' then
      -- update in item's block list.
      blocks[i] = pandoc.Plain(blk.content)
    end
  end
  return blocks
end

function compactifyList (l)
  -- l.content is an instance of pandoc.List, so the following is equivalent
  -- to pandoc.List.map(l.content, compactifyItem)
  l.content = l.content:map(compactifyItem)
  return l
end

return {{
    BulletList = compactifyList,
    OrderedList = compactifyList
}}

This did not work either. Is this a bug in Pandoc, or has the API changed? Help getting this to work would be appreciated.

Reproducing

Here is an input docx file. Here is the command I run and output:

➤ pandoc --lua-filter list_filter.lua file.docx -t markdown
Test some text

-   Bullet point

-   another bullet

-   more text

more text afterward

The output I am expecting:

Test some text

-   Bullet point
-   another bullet
-   more text

more text afterward
jgm commented 4 years ago

If you do pandoc -t native on the docx file, you'll see that pandoc parses it as follows:

[Para [Str "Test",Space,Str "some",Space,Str "text"]
,BulletList
 [[BlockQuote
   [Para [Str "Bullet",Space,Str "point"]]]
 ,[BlockQuote
   [Para [Str "another",Space,Str "bullet"]]]
 ,[BlockQuote
   [Para [Str "more",Space,Str "text"]]]]
,Para [Str "more",Space,Str "text",Space,Str "afterward"]]

This structure isn't covered by your filter. It looks like our docx reader parses this as blockquotes because of the left indentation.

makew0rld commented 4 years ago

Thanks for looking into this. There are actually more steps then I mentioned. The document was created in Google Docs using the bullet list button, and then downloaded as a docx file. I don't have Word installed, but when I open the document in LibreOffice it shows up correctly as a bullet list, and it's not possible to de-indent the bullet points at all.

Is this a bug in Pandoc's docx reader then?

Also, do you think you could help construct a filter that produces the desired result? Thank you.

jgm commented 4 years ago

Yes, I'd say this is a problem with the docx reader's detection of block quotes, especially if Google Docs defaults to adding indentation to lists. @jkr what do you think? You can try posting on pandoc-discuss for filter help.

makew0rld commented 4 years ago

I'd say this is a problem with the docx reader

Okay, good to know thank you.

Google Docs defaults to adding indentation to lists

I can confirm this is a specific Google Docs issue. I have added more bullet points to the docx file, but this time using the LibreOffice bullet button. You can download the new file here.

Here is the markdown output using the second filter:

Test some text

-   Bullet point

-   another bullet

-   more text

more text afterward

-   bullet point
-   in libreoffice
-   testing again

And here is the native output (again using the second filter):

[Para [Str "Test",Space,Str "some",Space,Str "text"]
,BulletList
 [[BlockQuote
   [Para [Str "Bullet",Space,Str "point"]]]
 ,[BlockQuote
   [Para [Str "another",Space,Str "bullet"]]]
 ,[BlockQuote
   [Para [Str "more",Space,Str "text"]]]]
,Para [Str "more",Space,Str "text",Space,Str "afterward"]
,BulletList
 [[Plain [Str "bullet",Space,Str "point"]]
 ,[Plain [Str "in",Space,Str "libreoffice"]]
 ,[Plain [Str "testing",Space,Str "again"]]]]

It appears that LibreOffice is adding bullet points correctly, but Google Docs is putting them in a block quote.

jgm commented 4 years ago

By the way, here's the code that treats indentation as a block quote:

    | Just left <- indentation pPr >>= leftParIndent -> do
        let pPr' = pPr { indentation = Nothing }
            hang = fromMaybe 0 $ indentation pPr >>= hangingParIndent
        transform <- parStyleToTransform pPr'
        return $ if (left - hang) > 0
                 then blockQuote . transform
                 else transform
    | otherwise -> return id

Text.Pandoc.Readers.Docx at l. 534. It would be quite easy to remove this clause, and I'm tempted to do that, but it would be good to hear from @jkr, who may have had good reasons for putting this there. Without this, BlockQuote would be triggered only by Quote or BlockQuote styles.

makew0rld commented 4 years ago

Thanks, would be happy to see this removed if it's appropriate. In the meantime I'll try and figure out a filter.

makew0rld commented 4 years ago

For anyone else who finds this thread, I've updated the second filter I mentioned in my original comment to handle this as well.

-- Source: https://stackoverflow.com/a/57943159/7361270
-- Modified by makeworld

-- Iterate over all blocks in an item, converting 'top-level'
-- Para into Plain blocks.
function compactifyItem (blocks)
  -- step through the list of blocks step-by-step, keeping track of the
  -- element's index in the list in variable `i`, and assign the current
  -- block to `blk`.
  -- 
  for i, blk in ipairs(blocks) do
    if blk.t == 'Para' then
      -- update in item's block list.
      blocks[i] = pandoc.Plain(blk.content)
    elseif blk.t == 'BlockQuote' then
      -- It's a Google Doc thing, where each bullet is in a blockquote
      -- https://github.com/jgm/pandoc/issues/6824
      blocks[i] = pandoc.Plain(blk.content[1].content)
    end
  end
  return blocks
end

function compactifyList (l)
  -- l.content is an instance of pandoc.List, so the following is equivalent
  -- to pandoc.List.map(l.content, compactifyItem)
  l.content = l.content:map(compactifyItem)
  return l
end

return {{
    BulletList = compactifyList,
    OrderedList = compactifyList
}}