retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.21k stars 285 forks source link

Exporting the reference of a journal's double issues #925

Closed laspic closed 6 years ago

laspic commented 6 years ago

Exporting references

I noticed something strange when exporting the reference of a journal's double issue.

Here is two slightly different references:

Here is what ZBB gives when exporting them in .bib file, using Better BibLaTeX format:

@article{Bob1996,
  title = {A {{Very Nice Title}}},
  volume = {10},
  number = {2},
  journaltitle = {Annual Review of Nothing},
  date = {1996},
  author = {Bob, Bob}
}

@article{John1996,
  title = {A {{Nice Title}}},
  volume = {10},
  issue = {2-3},
  journaltitle = {Annual Review of Something},
  date = {1996},
  author = {John, John}
}

Why does the field name 'number' turns into 'issue' when its content is not a specifically numerical value? Is it a bug or a feature ? When you customize a biblatex style, dealing with that causes some headaches ! :-)

Report ID : 4DCD7RZU

retorquere commented 6 years ago

What field would it have to be in instead? @njbart, opinions?

laspic commented 6 years ago

In my opinion, it has to remain 'number', whatever its content. Is there a good reason to rename it when filled by something else than a number ?

retorquere commented 6 years ago

Ah, like that. The biblatex manual section 2.2.2 says:

number: field (integer)

The number of a journal or the volume/number of a book in a series. See also issue as well as §§2.3.7 and 2.3.9. With @patent entries, this is the number or record token of a patent or patent request. It is expected to be an integer, not necessarily in arabic numerals since biber will automatically from roman numerals or arabic letter to integers internally for sorting purposes.

and 2-3 is not an integer.

laspic commented 6 years ago

I would further add (same source):

issue field (literal)

The issue of a journal. This field is intended for journals whose individual issues are identified by a designation such as ‘Spring’ or ‘Summer’ rather than the month or a number. Since the placement of issue is similar to month and number, this field may also be useful with double issues and other special cases. See also month, number, and § 2.3.9.

So, it's clear. This not a bug, but a feature. Thanks !

njbart commented 6 years ago

IMHO, the vast majority of biblatex’s design decisions make a lot of sense, but this one (distinguishing numbers (field number) from ranges and other stuff (field issue)) does not:

It does not just create a minor inconsistency in the database, but standard biblatex even renders issue = {2–3} different than number = {2} – arguably not quite what’s expected, at the very least not in case of a number range:

screen shot 2018-03-06 at 15 44 01

I tested number = {2–3} with biblatex, biblatex-apa, and biblatex-chicago, and it is rendered as expected in all cases. (If you want a typographically correct en dash, you’ll have to enter it yourself, though.)

So I don’t think BBT would create any problems if it mapped ranges etc. to biblatex number.

As to the underlying biblatex issue, this is certainly something worth bringing up at the biblatex forum.

retorquere commented 6 years ago

I would normally shy away from deviating what seems to be quite explicitly in the manual, but I'll follow your recommendation here. Or we could wait for responses from the biblatex forum.

laspic commented 6 years ago

I've already posted a question concerning this issue on Tex.sx : https://tex.stackexchange.com/questions/418590/biblatex-doesnt-recognize-the-journals-issue-number-when-filled-by-a-non-numer

Feel free to join us !

moewew commented 6 years ago

The situation is a bit more complicated here. As far as the TeXnical side of biblatex is concerned ranges such as 2-3 should be fine in the number field. Biber does not complain either, not even with --validate-datamodel. So I think it would definitely be worth thinking about opening the number field a bit.

issue really is only good for things like 'Summer'/'Winter'. Everything else looks just weird...

retorquere commented 6 years ago

So does that mean biblatex accepts things that according to its own manual it should not?

moewew commented 6 years ago

Yes and no, but more yes than no. I'll open a ticket for biblatex to discuss this.

moewew commented 6 years ago

See https://github.com/plk/biblatex/issues/726

retorquere commented 6 years ago

Assuming the biblatex question is resolved this way, what would be the rule for issue vs number? To issue if it has letters? If it has less than half numbers? Something else?

njbart commented 6 years ago

Given that standard biblatex (I tested the authoryear style), biblatex-apa, and biblatex-chicago (again, authordate style) all seem to render any string they encounter in a number field (even strings such as “Summer”) quite sensibly (usually enclosed in parentheses), while biblatex-apa does not render issue fields (of @article items) at all, I’m tempted to say, for the time being anything found in a Zotero “Number” field should be mapped to biblatex’s number.

retorquere commented 6 years ago

Coming right up.

moewew commented 6 years ago

Do whatever works best for you. I have a disliking for issue, so my ideas reflect that. But if it is easier for you to detect what should go into number and drop everything else into issue instead of detecting issue and drop the rest to number then that is what you should go for.

retorquere commented 6 years ago

The choices I make affect quite a few others -- it's not a matter of what works for me, it's a matter of what works for biblatex users. If I know what works for biblatex users, I can see whether that is feasible to implement.

moewew commented 6 years ago

This is about semantics more than about simple pattern recognition (I think). So I'm not sure you can even find a solution that works for everything.

The majority of your users should have a simple integer in the number field. Then there are a few with S1 and ranges. So for them it doesn't really matter what you plan on doing since it has emerged that whatever exactly you do, they will get things exported to number. While I know possible values for issue I have never actually seen one of these out in the wild, so I can't tell you if anybody would be affected.

retorquere commented 6 years ago

So does that collapse to "just stick it in number"? That's easy enough of course.

moewew commented 6 years ago

I don't know. As I say I have never seen an issue that could or should not have been a number. That is not to say that I can categorically say that everything has to go to number. It's just evidence that very few users would want issue.

moewew commented 6 years ago

The conservative approach from your side would be to export everything to issue, but export integer ranges with "S" or "Suppl.", "Supp.", "Supplement" to number. That way you only change the status quo for the cases mentioned in the original report here and the sensible "S1", "Suppl. 2", "Supplement 3" cases.

retorquere commented 6 years ago

I'm not too worried about the status quo in this case, more about the field export being the best biblatex it can be given the input.

moewew commented 6 years ago

OK. I think I have already said all I could say. There is no simple way to decide between number and issue. And I think that whatever algorithmic solution one can come up with, there is always a way to find a pathological case that does the wrong thing (whether that case is actually useful is another question ...).

My preferred solution would be to only put divisions of the year (seasons, arbitrary term divisions, ... what have you) in issue and everything else in number (hopefully only integer ranges and a few special keywords like "S", "Suppl." etc., etc. - if an input doesn't match any of the two criteria it is probably garbage and needs to be reworked). But of course there is no exhaustive list of these divisions of the year to test for, so that is not workable...

retorquere commented 6 years ago

It doesn't need to be exhaustive, it's going to be configurable. I'll see what I can whip up.

retorquere commented 6 years ago

In Zotero, only articles have an issue field, and they don't have a number field; other reference types have number fields but never and issue field. I could just stuff both into number on output. @njbart?

njbart commented 6 years ago

I could just stuff [CSL number and issue] into [biblatex] number on output.

As I said before: I currently can’t think of very many reasons why you shouldn’t. Technically, CSL issue is a number variable, so people shouldn’t be entering stuff other than numbers here anyway, though neither the GUI nor the CSL styles seem to enforce this in any way.

The only thing I ever came across that looks a tiny bit odd is when standard biblatex outputs a volume and number as, e.g., “7.Summer” (with volume and issue, it’s the slightly more sensible “7 (Summer)”) – but if you want to keep this simple (and not engage in parsing field content, placing “1”, “1–2”, “S1”, S1–S2”, “XVI”, … into number, and “Summer”, “Suppl.”, … into issue), I wouldn’t worry about this detail (which could also be dealt with by the biblatex maintainers, or the users at latex/biblatex run time), and indeed “stuff” everything into biblatex number.

retorquere commented 6 years ago

Detecting “1”, “1–2”, “S1”, S1–S2”, “XVI”, ... wouldn't be too onerous. It looks to me to be:

  1. split by dash if present, and if there are at most two components
  2. of the components, verify that they're roman numerals, or an integer preceded by an optional "S"

In my test suite, I find the following that do not match and would thus end up in issue:

"1,2" "1/2" "2003/32864" "2, (1988)" "2.3" "3/4" "363a" "93-06-03" "93-06-04" "93--09--06" "93-12-04" "A" "AI-TR-550" "BCRX0929142L" "BU-CEIS-94-30" "C3" "C6" "CMU-CS-79-124" "CMU-CS-94-123" "CS-90-37" "CSLI-85-32" "CSLI-86-63" "CS-TR-3239 and UMIACS-TR-94-31" "CS-TR-3981" "CS-TR-4604, UMIACS-TR-2004-46" "CVC TR-98-003" "DOE/ER/10825--1" "DOE/ET/23010--12" "DOE/ET/23010--17" "DOE/ET/23010--9" "DOE/ET/23010--T1" "DOE/ET/23010--T10" "Faglig rapport fra DMU nr. 814" "FIA-93-17" "IRCS--93--03" "KSL 92-71" "KSL 93-04" "LiTH-IDA-R-92-30" "Logic-92-1" "May" "MCCS-90-194" "MIT/LCS/TR--504" "NSF-AER-75-23453-4" "NSF-AER-75-23453-5" "October" "Part 1" "PRMX0908015L" "Research Report 171" "RT 2005-08-47" "Sagsref.: 1270919" "spécial anniversaire 1978-1988" "SU-CIS-90-02" "SU-CIS-90-07" "SU-CIS-90-15" "TR 215" "TR--93--027" "TR-93-065" "US2006/0256608A1" "Working Paper 303"

moewew commented 6 years ago

I don't know anything about the Zotero data model, but issue is only valid for @article entries on the biblatex end. Some of the cases in the tests suite seem to be of a different type ("Working Paper 303" would probably be a @report, same goes for "Research Report 171", "Faglig rapport fra DMU nr. 814", "DOE/ER/10825--1" and probably many others), in that case the only appropriate field is number.

Additionally, I would think it acceptable to also let

got into number.

retorquere commented 6 years ago

Alright, those are easy to spot.

In zotero it's also only article types (magazine article and journal article) that can have issue, all the others have number, but what people put in either is unconstrained.

The 2, (1988) comes from a contributed journal article reference. Most of my test cases come from "the wild", only a minority of my test references are synthetic references meant to trigger a specific error. In this case it looks like the 1988 is there as an origdate of sorts, the reference also has a date, set to 2002 (the article referred to is this one).

moewew commented 6 years ago

You quite probably have more important things to attend to, but is there anything we can do here to help you?

The best short- (and maybe even long-)term solution would be to always use number and just ignore issue. issue is almost never what a user wants, except in the weird "Summer", "Winter" cases.

retorquere commented 6 years ago

Sorry, this slipped off my radar. So what of the list at https://github.com/retorquere/zotero-better-bibtex/issues/925#issuecomment-374395958 ? All to number?

moewew commented 6 years ago

The majority of these does not really make sense as number or issue of an @article. Many look as though they should be the number of a @report/@techreport or @patent. Since I can't say where they come from I can't really say anything about them.

The only values that I think could possibly make sense for @article are the following.

I don't know what your favoured approach towards this is at the moment. At some point you were hinting at a user-defined list for issue, I think that would make sense and you'd probably want to pre-populate it with the seasons and maybe month names.

retorquere commented 6 years ago

This is where the data comes from for stuff that wasn't caught by the matching:

journalArticle.issue = "2, (1988)"
journalArticle.issue = "2.3"
journalArticle.issue = "A"
journalArticle.issue = "C3"
journalArticle.issue = "C6"
journalArticle.issue = "May"
journalArticle.issue = "October"
journalArticle.issue = "Part 1"
journalArticle.issue = "spécial anniversaire 1978-1988"
patent.number = "US2006/0256608A1"
report.number = "363a"
report.number = "93-06-03"
report.number = "93-06-04"
report.number = "93--09--06"
report.number = "93-12-04"
report.number = "AI-TR-550"
report.number = "BU-CEIS-94-30"
report.number = "CMU-CS-79-124"
report.number = "CMU-CS-94-123"
report.number = "CS-90-37"
report.number = "CSLI-85-32"
report.number = "CSLI-86-63"
report.number = "CS-TR-3239 and UMIACS-TR-94-31"
report.number = "CS-TR-3981"
report.number = "CS-TR-4604, UMIACS-TR-2004-46"
report.number = "CVC TR-98-003"
report.number = "DOE/ER/10825--1"
report.number = "DOE/ET/23010--12"
report.number = "DOE/ET/23010--17"
report.number = "DOE/ET/23010--9"
report.number = "DOE/ET/23010--T1"
report.number = "DOE/ET/23010--T10"
report.number = "Faglig rapport fra DMU nr. 814"
report.number = "FIA-93-17"
report.number = "IRCS--93--03"
report.number = "KSL 92-71"
report.number = "KSL 93-04"
report.number = "LiTH-IDA-R-92-30"
report.number = "Logic-92-1"
report.number = "MCCS-90-194"
report.number = "MIT/LCS/TR--504"
report.number = "NSF-AER-75-23453-4"
report.number = "NSF-AER-75-23453-5"
report.number = "Research Report 171"
report.number = "RT 2005-08-47"
report.number = "Sagsref.: 1270919"
report.number = "SU-CIS-90-02"
report.number = "SU-CIS-90-07"
report.number = "SU-CIS-90-15"
report.number = "TR 215"
report.number = "TR--93--027"
report.number = "TR-93-065"
report.number = "Working Paper 303"
statute.number = "BCRX0929142L"
statute.number = "PRMX0908015L"
moewew commented 6 years ago

Since we are talking about @articles (and @periodicals) only (these are the only types that can have an issue anyway), only journalArticle.issue seem relevant. All other fields must be exported to number anyway, there is no choice there.

You already explained journalArticle.issue = "2, (1988)" and I would say the input there is incorrect and a user can not expect good output from that input no mater what you do. number will probably give better results here, though.

journalArticle.issue = "2.3" is a bit of a weird one, looks OK for number to me. It's essentially a number.

journalArticle.issue = "A"
journalArticle.issue = "C3"
journalArticle.issue = "C6"

are all short designators that would be OK in number

journalArticle.issue = "Part 1"

Would look weird in number in some styles, barely acceptable in others. Would look borderline-weird in issue as well.

journalArticle.issue = "May"
journalArticle.issue = "October"
journalArticle.issue = "spécial anniversaire 1978-1988"

see my last comment.

retorquere commented 6 years ago

In Zotero, only journalArticle and magazineArticle can have issue fields.

retorquere commented 6 years ago

So would detecting /^[A-Z]?[0-9]+(\.[0-9]+)?$/i as number be reasonable? That would leave

journalArticle.issue = "2, (1988)"
journalArticle.issue = "May"
journalArticle.issue = "October"
journalArticle.issue = "Part 1"
journalArticle.issue = "spécial anniversaire 1978-1988"
moewew commented 6 years ago

I'm not great with Regex, but I think that would not match 3-4, 2,3, Suppl. 1. The first two of those are very clearly number, the latter not that clearly, but still number seems better than issue. You may also want to detect Roman numerals and should probably allow things like A, B, C without a number as well.

retorquere commented 6 years ago

3-4 and 2,3 get split up by the first stage, what I have now does:

  1. split by /-+|–|,|\//, and if there are at most two components
  2. of the components, verify that they're roman numerals, or a number preceded by an optional letter, or a sole letter

Anything that doesn't match both rules goes to issue. This only applies to journalArticle and magazineArticle; all the others go to number without any checks.

Suppl. 1 would not get detected on an article so would end up in issue currently.

moewew commented 6 years ago

That's probably as good as it gets if one wants a simple algorithmic solution that uses issue as default.

blip-bloop commented 6 years ago

:robot: this is your friendly neighborhood build bot announcing test build 6116 ("Merge branch 'master' into 925").

retorquere commented 6 years ago

All my tests pass on the new rules. Could you guys give 6116 a go to see if it behaves as you want?

blip-bloop commented 6 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5.0.110.6117.issue-925 ("ncu").

laspic commented 6 years ago

Everything works fine here (with the 5.0.111). Now, the example given in the first post gives:

@article{Bob1996,
  title = {A {{Very Nice Title}}},
  volume = {10},
  **number** = {2},
  journaltitle = {Annual Review of Nothing},
  date = {1996},
  author = {Bob, Bob}
}

@article{John1996,
  title = {A {{Nice Title}}},
  volume = {10},
  **number** = {2-3},
  journaltitle = {Annual Review of Something},
  date = {1996},
  author = {John, John}
}

Thanks a lot!

retorquere commented 6 years ago

Cool, thanks for the confirmation

github-actions[bot] commented 3 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.