jgm / citeproc

CSL citation processing library in Haskell
BSD 2-Clause "Simplified" License
150 stars 13 forks source link

Multiple bibliographies #5

Open denismaier opened 3 years ago

denismaier commented 3 years ago

Now, with the new citeproc library being more tightly integrated into pandoc proper, would that open ways to new features like multiple bibliographies? I'm thinking of having one bibliography with primary sources, another with secondary sources; or, per-section bibliographies, etc.

jgm commented 3 years ago

We could think about this, though it's not a priority right now. How would you determine which item goes into which bibliography?

denismaier commented 3 years ago

Yes. Certainly not a top priority at the moment. But it's a much requested feature.

How would that work? At a most basic level perhaps some filters to exclude/include items:

:::refs include:type=book :::

These filters could also be defined in the metadata.

For per-section bibliographies you'll need a mechanism to define the segments, or reset the bibliography at certain points.

denismaier commented 3 years ago

Defining bibliographies in the metadata could look like this:

bibliography-filters:
  - id: bibliographyA 
    exclude: 
      - type: book
  - id: bibliographyB
    include: 
      - type: book

In the document, you might call this with:

::: {#refs .bibliographyB}
:::

Or:

::: {#refs filtered=bibliographyB}
:::
jgm commented 3 years ago

I like the suggestion at https://github.com/jgm/pandoc-citeproc/issues/89 to use the existing keyword field. That would be simple and flexible. Maybe this could be done entirely in the body of the document:

::: {#refs1 .csl-bib-body keywords="primary,aristotle"}
:::

Instead of looking for #refs, pandoc could look for class csl-bib-body and apply the filter.

Maybe one could even allow boolean operators here:

keywords="ancient AND NOT (aristotle OR plato)"
jgm commented 3 years ago

Note: if this were to be done in pandoc itself, then pandoc-citeproc would have to return the keywords along with the bib entries. Alternatively, the sorting could be done in citeproc; in this case pandoc would have to pass IT the keywords and bib ids.

denismaier commented 3 years ago

I like that suggestion. I just don't think you should limit it to the keywords field. Users might want to filter based on whatever criteria. Author x, books before 1900. And so on. Why not pass a criteria or filter to citeproc where different filters could be specified. That would be extensible. keywords would be a good start.

denismaier commented 3 years ago

What do you think about this now a couple of months later?

nsheff commented 3 years ago

I have a use case where I want per-section bibliographies, split references into 1) those cited in the main text; and 2) those cited in supplemental text. The complete document is just the two parts put together: 1) main and 2) supplement. I want one references section for references cited in the main text, which shows up right after the main text, and one with references cited in the supplementary material, which shows up at the end of the document.

A simple workaround is to just build the documents separately, and then merge them together -- this works, but then it makes it harder to cross-reference figures and tables, because the main text doesn't have access to the supplemental labels and vice versa. So, it would be much better if there were a solution that would allow the bibliographies to be split by section of when they are cited.

In this use case, I'd only have a single .bib source file, but then I have 2 different references sections -- I believe this is the use case @denismaier refers to as "per-section bibliographies", which to me is distinct from a 'primary and secondary' sources type, which could both be intermingled throughout sections.

It seems to me that the section-based bibliographies could be much simpler to implement; all you'd need to do is have some kind of "flag" that switches to the next bibliography; and as you parse the markdown, when you reach that flag, then from that point as you start to accumulate citations, they are added to the next reference list.

I think this could be accomplished pretty easily with a lua filter for someone with experience. I have started down that path a bit but I'm just too unfamiliar with lua syntax and pandoc objects to make it work.

denismaier commented 3 years ago

I think this could be accomplished pretty easily with a lua filter for someone with experience. I have started down that path a bit but I'm just too unfamiliar with lua syntax and pandoc objects to make it work.

@nsheff You mean such as this one?

nsheff commented 3 years ago

@nsheff You mean such as this one?

Close, but that filter forces a bibliography at the end of every section, so it didn't work for my purpose. But I was able to adapt the idea to a new filter I just posted here: https://github.com/databio/sciquill/tree/master/pandoc_filters/multi-refs

In my version, the user has control over where the bibliographies go -- and it uses a different approach that goes through the document accumulating citations until it reaches a user flag (<div id='multi-refs'></div>) at which point, it produces a bibliography of everything cited up to that point in the document. So, it's no longer tied to a particular section header level. I also made it so that the bibliographies retain the original numbering.

So, thanks for sharing, that was what I needed! Any pointers on my filter are appreciated.

aubertc commented 1 year ago

The multiple-bibliographies lua filter allows to do some of that, but unfortunately the separation is made by files, and not by keywords…

zmbc commented 3 months ago

In case it's helpful to anyone: the multi-refs filter above didn't work for me in strange ways (for example, if citations were shared across bibliographies, they would no longer be numbered in consecutive order within a bibliography). I noticed that section-bibliographies (a newer version of the bibliography-per-section filter referenced above) has since moved to running citeproc separately on each section, which seemed like a much better design.

Here's what I came up with, cobbling together some bits from the above. It requires that <div class='multi-refs'></div> be at the top level of the document (not within any divs), but that is how things were by default for me, using Pandoc to convert a Markdown document. YMMV. (I'm using this with a very old version of Pandoc.)

local utils = require 'pandoc.utils'
local run_json_filter = utils.run_json_filter

-- This works on newer Pandoc versions but doesn't on pandoc 2.2.3.2
-- local function run_citeproc (doc)
--   if PANDOC_VERSION >= '2.19.1' then
--     return pandoc.utils.citeproc(doc)
--   elseif PANDOC_VERSION >= '2.11' then
--     local args = {'--from=json', '--to=json', '--citeproc'}
--     return run_json_filter(doc, 'pandoc', args)
--   else
--     return run_json_filter(doc, 'pandoc-citeproc', {FORMAT, '-q'})
--   end
-- end

local function run_citeproc (doc)
  return run_json_filter(doc, 'pandoc-citeproc')
end

--- Filter to the references div and bibliography header added by
--- pandoc-citeproc.
local remove_pandoc_citeproc_results = {
  Header = function (header)
    return header.identifier == 'bibliography'
      and {}
      or nil
  end,
  Div = function (div)
    return div.identifier == 'refs'
      and {}
      or nil
  end
}

-- stackoverflow
function table.contains(table, element)
  for _, value in pairs(table) do
    if value == element then
      return true
    end
  end
  return false
end

function create_bibliographies (doc)
  local blocks = {}
  local new_blocks = {}
  for block_id,block_data in pairs(doc.blocks) do
    if block_data.attr and block_data.attr.classes and table.contains(block_data.attr.classes, "multi-refs") then
      local tmp_doc = pandoc.Pandoc(new_blocks, doc.meta)
      local new_doc = run_citeproc(tmp_doc)
      for _, block_to_add in pairs(new_doc.blocks) do
        blocks[#blocks+1] = block_to_add
      end
      new_blocks = {}
    else
      new_blocks[#new_blocks+1] = block_data
    end
  end
  for _, new_block in pairs(new_blocks) do
    blocks[#blocks+1] = new_block
  end
  return pandoc.Pandoc(blocks, doc.meta)
end

return {
  -- remove result of previous pandoc-citeproc run (for backwards
  -- compatibility)
  remove_pandoc_citeproc_results,
  {Pandoc = create_bibliographies},
}
nsheff commented 3 months ago

the multi-refs filter above didn't work for me in strange ways (for example, if citations were shared across bibliographies, they would no longer be numbered in consecutive order within a bibliography)

Can you explain this in more detail? Did you use the multiref_no_duplicates:true flag?

zmbc commented 3 months ago

It turns out I was having several unrelated issues: two bugs in the multi-refs filter (https://github.com/databio/sciquill/issues/12, https://github.com/databio/sciquill/issues/11), and a deep misunderstanding of the expected behavior.

The multi-refs filter essentially splits a bibliography between sections. See for example the sample PDF output here: https://github.com/databio/sciquill/blob/master/pandoc_filters/multi-refs/sample.pdf. The second bibliography starts with reference number 4, since it is a continuation of the first. In a Word doc, it looks like:

What I wanted was an entirely separate bibliography, like this:

The simpler filter I posted above achieves this.