typst / biblatex

A Rust crate for parsing and writing BibTeX and BibLaTeX files.
Apache License 2.0
120 stars 15 forks source link

feat: add support for multiline author field #34

Closed yuxqiu closed 11 months ago

yuxqiu commented 1 year ago

Related Issue

Try to fix #28.

What are the changes/fixes in this PR?

This PR adds support for multiline author field.

In p. 16 of the BibLaTeX manual, it states that:

Name lists are parsed and split up into the individual items at the and delimiter.

However, the old implementation assumes that only <space>and<space> is valid, which causes the problem in the reference issue. The correct approach is that we consider a split valid only if the character before and and the next character after and are whitespace (either ASCII or Unicode whitespace is a valid option).

Leading or Trailing "and"

When "and" is encountered at the beginning or end of a person's name, and there are no other possible splits (no surrounding whitespaces), it should be considered as part of the name.

Consecutive "and"

When there are multiple consecutive "and" separated by spaces, we should treat the name between the two "and" as empty.

Experiment

I tried to test the above claim in latex with the following files, using the bib style file downloaded from here:

\documentclass[a4paper,10pt]{article}

\begin{document}

\cite{gamma1995design}

\cite{linebreak}

\cite{leadingand}

\cite{trailingand}

\cite{middleand}

\cite{andand}

\cite{andandand}

\bibliographystyle{IEEEtran}
\bibliography{cite}
\end{document}
@book{gamma1995design,
  title     = {Design patterns: elements of reusable object-oriented software},
  author    = {Gamma, Erich and Helm, Richard and Johnson, Ralph and Vlissides, John M.},
  year      = {1995},
}

@book{linebreak,
  title     = {Design patterns: elements of reusable object-oriented software},
  author    = {Gamma, Erich
               and
               Helm, Richard and
               Johnson, Ralph},
  year      = {1995},
}

@book{leadingand,
  title     = {Design patterns: elements of reusable object-oriented software},
  author    = {and Gamma, Erich and Helm, Richard and Johnson, Ralph and Vlissides, John M.},
  year      = {1995},
}

@book{trailingand,
  title     = {Design patterns: elements of reusable object-oriented software},
  author    = {Gamma, Erich and Helm, Richard and Johnson, Ralph and Vlissides, John M. and},
  year      = {1995},
}

@book{middleand,
  title     = {Design patterns: elements of reusable object-oriented software},
  author    = {Gamma, Erich and Hello and Helm, Richard and Johnson, Ralph and Vlissides, John M.},
  year      = {1995},
}

@book{andand,
  title     = {Design patterns: elements of reusable object-oriented software},
  author    = {Gamma, Erich and and Helm, Richard and Johnson, Ralph and Vlissides, John M.},
  year      = {1995},
}

@book{andandand,
  title     = {Design patterns: elements of reusable object-oriented software},
  author    = {Gamma, Erich and and and Helm, Richard and Johnson, Ralph and Vlissides, John M.},
  year      = {1995},
}

I get the following outputs:

image
yuxqiu commented 1 year ago

Some things to decide

How should we handle strings like this A and {<some whitespaces here>} and B?

I think it makes sense to treat it as an undefined behavior, even though biblatex treats it as a name with a blank character.

@book{emptystring,
  title     = {Design patterns: elements of reusable object-oriented software},
  author    = {Gamma, Erich and { } and Helm, Richard and Johnson, Ralph and Vlissides, John M.},
  year      = {1995},
}

Generated .bbl file:

\bibitem{emptystring}
E.~Gamma, { }, R.~Helm, R.~Johnson, and J.~M. Vlissides, \emph{Design patterns: elements of reusable object-oriented software}, 1995.

The current implementation of Person::parse treats it as an empty name.


(Not directly related to this PR) How should we handle multiple consecutive whitespace characters in a verbatim?

The biblatex package will treat all these consecutive blank characters as one blank character. However, the current implementation of crate retains all characters in the verbatim script. This is why the title is rendered incorrectly, as shown in https://github.com/typst/hayagriva/issues/47.

reknih commented 1 year ago

Thank you for your PR! I think it makes sense to treat verbatim blocks with whitespace between ands as UD (because it isn't defined anywhere afaik). The multiple whitespace characters should likely be collapsed if that's what biblatex/biber does.

yuxqiu commented 1 year ago

Thank you for your PR! I think it makes sense to treat verbatim blocks with whitespace between ands as UD (because it isn't defined anywhere afaik). The multiple whitespace characters should likely be collapsed if that's what biblatex/biber does.

Thank you for sharing your opinion! I think the verbatim issue should be investigated in detail and fixed in another PR later. So, this PR is now ready for review.

yuxqiu commented 1 year ago

Mysteriously, CI keeps failing at date parsing (I haven't changed this part of the code). On my computer, all tests pass successfully.


Edit

This may be related to the new version of the chrono crate that was released a few hours ago. We should add Cargo.lock to the repository.

yuxqiu commented 1 year ago

Will try to fix a tiny error later. So, I converted it to a draft now.

reknih commented 1 year ago

I skimmed your code and reworded a few comments. Looking forward to your fix!

yuxqiu commented 1 year ago

I chose to revert to the previous splitting strategy (splitting by keyword and then checking for surrounding characters) instead of splitting by whitespace and then checking for ==keyword. The latter made the new test on verbatim fail, as it could not clearly handle the whitespace between verbatim chunks.

What I do now is to try to keep all the whitespace between fields and trim only the beginning and end of the data in the latest vector before pushing it to the out. I think this makes the function more general-purpose, and it leaves the responsibility of handling non-leading and non-trailing whitespaces to the call site.