plk / biber

Backend processor for BibLaTeX
Artistic License 2.0
339 stars 38 forks source link

Years with fewer than four digits don't sort correctly #228

Closed moewew closed 6 years ago

moewew commented 6 years ago

In the following MWE it seems that years are padded from the right for sorting and not from the left

\documentclass[american]{article}
\usepackage{babel}
\usepackage{filecontents}
\usepackage[
  backend = biber,
  style = authoryear,
]{biblatex}
\addbibresource{\jobname.bib}
\begin{filecontents}{\jobname.bib}
@book{de:re:publica:0060,
  author  = {Cicero},
  title      = {De re publica -- 60},
  date     = {0060},
}
@book{de:re:publica:0010,
  author  = {Cicero},
  title      = {De re publica -- 10},
  date     = {0010},
}
@book{de:re:publica:0300,
  author  = {Cicero},
  title      = {De re publica -- 300},
  date     = {0300},
}
@book{de:re:publica:0100,
  author  = {Cicero},
  title      = {De re publica -- 100},
  date     = {0100},
}
@book{de:re:duplica,
  author  = {Cicero},
  title      = {De re Duplica},
  date     = {2018},
}
\end{filecontents}

\begin{document}
\nocite{*}

\printbibliography
\end{document}

results in

Cicero (10). De re publica – 10. — (100). De re publica – 100. — (2018). De re Duplica. — (300). De re publica – 300. — (60). De re publica – 60.

I know that I could use

\DeclareSortingTemplate{nyt}{
  \sort{
    \field{presort}
  }
  \sort[final]{
    \field{sortkey}
  }
  \sort{
    \field{sortname}
    \field{author}
    \field{editor}
    \field{translator}
    \field{sorttitle}
    \field{title}
  }
  \sort{
    \field{sortyear}
    \field[padchar=0]{year}
  }
  \sort{
    \field{sorttitle}
    \field{title}
  }
  \sort{
    \field{volume}
    \literal{0}
  }
}

but somehow it feels weird that I would have to enable proper integer sorting for the year...

plk commented 6 years ago

Looks like a bug in the sorting key extractor code - seems to be sorting year as a string. Looking into it.

moewew commented 6 years ago

I see you have added a commit to fix this. Unfortunately, the change leads to unexpected result in some edge cases. sortyear should be given precedence over year, it should not live in a \sort section of its own.

\documentclass[british]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}
\usepackage{csquotes}

\usepackage[style=authoryear, backend=biber]{biblatex}

\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}
@book{appleby,
  author   = {Humphrey Appleby},
  title    = {A Title},
  sortyear = {1980},
  date     = {1990},
}
@book{appleby:b,
  author   = {Humphrey Appleby},
  title    = {B Title},
  sortyear = {1980},
  date     = {1989},
}
\end{filecontents}

\addbibresource{\jobname.bib}

\iffalse
\DeclareSortingTemplate{nyt}{
  \sort{
    \field{presort}
  }
  \sort[final]{
    \field{sortkey}
  }
  \sort{
    \field{sortname}
    \field{author}
    \field{editor}
    \field{translator}
    \field{sorttitle}
    \field{title}
  }
  \sort{
    \field{sortyear}
  }
  \sort{
    \field{year}
  }
  \sort{
    \field{sorttitle}
    \field{title}
  }
  \sort{
    \field{volume}
    \literal{0}
  }
}
\fi

\begin{document}
\cite{appleby:b,appleby}
\printbibliography
\end{document}

As fas as I can see the problem is that sortyear is a literal and year an integer and so we can't meaningfully compare the two if they are around... Is that correct?

plk commented 6 years ago

It's not quite that. It's that the sorting data schema which is needed to generate the internal data structures needed to construct sortkey extraction and generation structures is currently not generated per-key. It is very complicated to fix this in biber as the particular field selected for a sort would need to be tracked per-entry to generate the the correct sorting data schema. Thinking about it.

plk commented 6 years ago

It's surprising I didn't notice this before. However, it's really difficult to solve this. Sorting needs to know the datatypes of what it is comparing and this assumes that everything in a \sort is the same datatype (well, it really assumes that everything is either an integer or isn't). sortyear isn't necessarily an integer, by design. The construction of the sorting dataschema uses this assumption as a shortcut by detecting the data type of a \sort set by just looking at the first element and since sortyear is a literal, it sorts year as a literal too. This is more than a biber issue, it's a sorting algorithm problem. If sortyear were an integer datatype in the datamodel, it would be fine - what do you think about that solution?

moewew commented 6 years ago

People (and in fact we - as in biblatex-examples.bib) use things like sortyear = {1984-0}, all the time. That would have to continue to work, so it seems tricky to make sortyear an integer...

plk commented 6 years ago

True. We have to change something though as year is more important than sortyear. This is rather intractable - we have to compare either numerically or alphabetically for each \sort and currently neither option works and never really can unless we can guarantee the data type by the data model or the sorting spec ...

moewew commented 6 years ago

We could just pad the year with zeros automatically and hope for the best. This should work for positive years, not sure about negative years...

plk commented 6 years ago

Really don't want to do that - that's what we used to do and I switched to better sort algorithm because padding and string sort is awful with the expanded ISO date stuff we now support. It seems to me that the whole existence of sortyear is strange as a literal anyway. Probably should be an integer. I suspect that the majority of cases for this can be solved by simply putting another \sort in after the sortyear/year to discriminate further. This would be a somewhat breaking change but I think in the long term, it's better.

moewew commented 6 years ago

Conceptually that would be better, I agree. But I fear it would be too big a change to render sortyear unusable. With integers you simply can't get fine sorting like sortyear = {1984-1}, vs sortyear = {1984-2},

plk commented 6 years ago

True but that's really an abuse of the field anyway. It's essentially a way of making the correct semantic solution of having a following \sort macro into a hacky syntactic solution. If there is an ordering within a year (which is exactly what this syntax is designed to do), then there should be a further month or season or something like that.

moewew commented 6 years ago

Yeah, theoretically I agree. But practically it can happen that one needs to control the year sorting and does not have other semantic options available. Think of two @inbooks of the same author in the same book, where you want to sort the first before the second chapter, but sorting by title would give the opposite result. Sure I could add pages to the sorting scheme, but that would be ludicrous.

plk commented 6 years ago

I'm not sure adding pages would be ludicrous in those circumstances if semantically you want the paper earlier in the collection listed first as it's the pages that determine that ...

moewew commented 6 years ago

Mhh, I really hoped I could win you over with pages (that's why I did not go for a volume example as in knuth:ct:a etc.) ;-).

Again, in principle I agree. But sortyear is a really well established hack (even biblatex-examples.bib has 10 instances of it) and I'm really wary of getting rid of it.

plk commented 6 years ago

There is no way to make this work perfectly with hacked sortyears and the current situation is the worst I think. In general, sortX fields are the same datatype as the X field - sortyear is the exception and I think we need to fix that. I propose:

moewew commented 6 years ago

I can't think of any way that coerces sortyear to int while keeping the sorting of common sortyear idioms as 1986-00, 1986-01 as expected.

plk commented 6 years ago

There isn't really a generalisable way but this current sortyear hacking is horrible and exactly the sort of thing that biblatex was designed to avoid. Since it is used mostly to sort collections before collection items etc. isn't volume really for this?

plk commented 6 years ago

For example, take the Nietzsche texts in the examples.bib. If you remove the sortyear from them, you get the same results because we already sort by volume after the year. So it's not clear that the hack is even needed in there?

moewew commented 6 years ago

For biblatex-examples.bib's examples volume sorting should indeed do the right thing. I assume in general people should be able to define a proper sort algorithm and with it should be able to write down a \DeclareSortingTemplate to sort their bibliography as expected without resorting to sortyear.

I still believe that sortyear hacking is a viable way to deal with some situations. The question really is how many users would be affected and how many things we are going to break badly with this. I have no idea how many people use sortyear. I'd have thought its use is not entirely unusual, but I may well suffer from sample bias.

plk commented 6 years ago

I honestly can't imagine that much would break as people using sortyear would naturally want it to compare stringwise with year. I'd rather break it and advise people on a per-case basis to use a proper sorting template. The current situation is much worse to my mind - year sorting is completely broken, it's just an accident of string sorting that it works for current millenium years.

moewew commented 6 years ago

Maybe we should at least start a short survey on comp.text.tex to inquire how widely used sortyear is.

plk commented 6 years ago

Ok - do you want to do that? I can prepare the changes in one commit in DEV so it can be tested and reverted.

moewew commented 6 years ago

OK, will do.

edit posted to c.t.t: https://groups.google.com/d/msg/comp.text.tex/CVSosV6gEiw/_C3sjunmAgAJ

u-fischer commented 6 years ago

Can't sortyear be a float? And 1984-01 interpretated as 1984.01?

(I personally never used sortyear, so simply changing it to int is fine for me too).

plk commented 6 years ago

Yes, there are some hacks like this that could be done but it won't help much as that's only one example of the possible formats. Also, if it was a float, year would need to be a float too and that slows down comparisons etc.

moewew commented 6 years ago

But it would give people a workaround to salvage their hacks. If floats are too slow, we could go with a fixed number of decimal places...

For the benefit of future me: The examples in biblatex-examples.bib still sort as expected without sortyear because of the sortitle field. But at least the knuth:ct:... examples would still work as expected with nyvt and without sorttitle. In particular some things can be fixed with sorttitle if sortyear is not available any more.

plk commented 6 years ago

An issue raised by the Knuth works is that volume is a string but its default datatype is an int. I think I may parse int fields as we do ranges to convert them to numbers for sorting.

moewew commented 6 years ago

According to the docs that already happens...

The volume of a multi-volume book or a periodical. It is expected to be an integer, not necessarily in arabic numerals since biber will automatically from roman numerals or arabic letter to integers internally for sorting purposes.

plk commented 6 years ago

Ah, yes, I see I already did this ...

u-fischer commented 6 years ago

By coincidence I just got a question about this with a real example. The user wanted to sort manually a number of reports and had used year={2011a} and year{2011b} which didn't work. Remembering this discussion I did not suggest sortyear={2011-a} or something like this, but considered a bit and now think that the suggestion of @moewew to use sorttitle is actually one of the logical solutions (inserting an extra field in the sort order would be another).