retorquere / zotero-better-bibtex

Make Zotero effective for us LaTeX holdouts
https://retorque.re/zotero-better-bibtex/
MIT License
5.27k stars 284 forks source link

Publisher Address of BibTeX Inproceedings Entries #1471

Closed qak closed 4 years ago

qak commented 4 years ago

BBT's BibTeX exporter doesn't seem to handle the Place field of Zotero items of type Conference Paper according to BibTeX's documentation. The documentation says:

The PROCEEDINGS and INPROCEEDINGS entry types now use the address field to tell where a conference was held, rather than to give the address of the publisher or organization. If you want to include the publisher’s or organization’s address, put it in the publisher or organization field.

The BibTeX exporter correctly maps Zotero's Event Place field to BibTeX's address field, but it discards Zotero's Place field. The documentation suggests that BBT should append the contents of Zotero's Place field to BibTeX's publisher field. BBT can't do the same with the organization field because Zotero doesn't support storing conference organisers.

The documentation doesn't make it clear what separator should be used between the publisher and address though. An obvious choice would be a comma, but this may not look the best in some BibTeX styles (for example when a semicolon is used to separate the entries surrounding the publisher). I still think that it'd be better for BBT to include the publisher's address with a possibly suboptimal separator rather than not including the information at all. Manually editing the the exported BibTeX file to replace commas with another separator should be quicker than manually adding the addresses of publishers.

This is the debug-report ID of an example conference paper item: QRN56KSA-euc. I haven't tested this with proceedings items and it doesn't even seem to be obvious to me how to add such an entry to Zotero. Please let me remind that this issue only pertains to the BibTeX exporter.

retorquere commented 4 years ago

I'm not entirely sure what to make of the docs you point out, but before I put Place anywhere I need to know what it means inside Zotero. What does Place mean for a Zotero Conference Paper if not Event Place?

I'm not a great fan of fields being put together with hard-wired separators. But if I do, I need to know what the semantics of Place are.

qak commented 4 years ago

Zotero's location fields are even more elaborate than I had originally thought. The documentation describes the relevant fields as follows:

Back to the quote from BibTeX's documentation:

The PROCEEDINGS and INPROCEEDINGS entry types now use the address field to tell where a conference was held, rather than to give the address of the publisher or organization. If you want to include the publisher’s or organization’s address, put it in the publisher or organization field.

It seems that for conference papers (inproceedings entries) BibTeX originally only supported recording the publisher's location in the address field. It seems that the developers later decided it would be better to support recording both the publisher's and the conference's location. The above quote from the documentation implies that conference papers (inproceedings entries) should instead record the conference's location in the address field and the publisher's location should be put in the publisher field. What is exactly meant by putting the publisher's location into the publisher field isn't further discussed, but it probably partially depends on the used bibliography style. I think I went too far in my initial comment by saying that the publisher's location should be appended to the publisher field separated with a comma. Assigning any of 'Publisher, Publisher Place', 'Publisher; Publisher Place', 'Publisher (Publisher Place)', 'Publisher Place: Publisher' to the BibTeX publisher field seem like valid options depending on the used bibliography style. The fact that BibTeX's documentation states that that the publisher's location can be placed into the publisher field is also discussed on Stack Exchange.

Maybe the best first step would be to make this functionality available from BBT's scripting interface. I tried to implement the described functionality using a postscript, but it seems that the values of Event Place and Publisher Place aren't made available the way I need. It looks like BBT currently assigns the values of fields Place, Event Place and Publisher Place all to the same variable item.place (so the interface only exposes the last value that got assigned). I had in mind something like

if (Translator.BetterBibTeX && item.itemType === 'conferencePaper' && item.eventPlace && item.publisher && item.publisherPlace) {
    reference.add({name: 'address', value: item.eventPlace})
    reference.add({name: 'publisher', value: item.publisher + ', ' + publisherPlace})
}

whilst adjusting the call to reference.add depending on what looks the best in the current bibliography style. Here's the debug-report ID with an updated conference paper item: 887N52WF-euc.

retorquere commented 4 years ago

Sorry I was gone so long -- Zotero 5.0.85 introduced some changes with required a bit of work on my end to address, but that's now done.

Where in the UI do you find Event Place and Publisher Place?

retorquere commented 4 years ago

Can you send me a new debug log? I forgot to pick up 887N52WF-euc and the log submissions get deleted after a week.

qak commented 4 years ago

Sorry I was gone so long -- Zotero 5.0.85 introduced some changes with required a bit of work on my end to address, but that's now done.

No problem, thanks for working on BBT =)

Where in the UI do you find Event Place and Publisher Place?

They don't have their own fields but they're entered into Zotero's Extra field.

Can you send me a new debug log? I forgot to pick up 887N52WF-euc and the log submissions get deleted after a week.

Here it is: CTCKDJ2V-euc.

blip-bloop commented 4 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5.2.20.6294 ("adjust tests for #1471")

Install in Zotero by downloading test build 5.2.20.6294, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

retorquere commented 4 years ago

Sponsor: is actually a stanza that has special meaning in the extra field -- it's recognized as "cheater syntax" and removed (by me) from the extra field, I just stuck it in collaborator because I didn't know about what sponsors meant for bibtex, but according to BibTeXing:

organization The organization that sponsors a conference or that publishes a manual.

So 6294 should get that right without requiring a postscript.

qak commented 4 years ago

So 6294 should get that right without requiring a postscript.

Thanks, I've tried it and Sponsor is correctly assigned to organization without needing a postscript.

~Should the issue be otherwise closed though? What about the handling of Place, Event Place and Publisher Place which I mentioned previously?~

Edit: Oh, the issue gets automatically reopened.

retorquere commented 4 years ago

Yep, if you comment on a closed issue, the bot reopens it as a reminder for me.

retorquere commented 4 years ago

We can still talk about those other fields if you want.

qak commented 4 years ago

I originally opened this issue because of how publishing addresses are handled in Zotero's Conference Paper items. In the previous comments I've described how Zotero supports fields Place, Event Place and Publisher Place. Using these fields from BBT isn't fully supported yet. I tried to implement the missing functionality using a postscript, but it seems that BBT treats the various place fields as identical. I've further described this in my previous comments here and here.

The issue you've fixed (assigning the value of Sponsor to organization) is in fact something that I, until you brought it up, haven't mentioned in any of my previous comments! That was a separate issue I worked around by using a postscript; I was planning to report the issue later on. I think that because one of my earlier comments references a postscript and because the debug-report contained another unrelated postscript (which was used to deal with Sponsor), you continued with the Sponsor problem and forgot that this issue addresses publishing locations instead :).

retorquere commented 4 years ago

Sorry about that. Your extra field has:

extra available as
Event Date: 2015-06-08/2015-06-11 item.extra
Event Place: Newport Beach, California, USA address
Publisher Place: New York, NY, United States item.extra
Sponsor: ACM SIGARCH organization
qak commented 4 years ago

Yes, and as I mentioned previously, I couldn't seem to access the values of Event Place and Publisher Place separately because BBT seems to write both of these fields to the same variable accessible through the scripting API. This means that the value of the variable depended on the order of Event Place and Publisher Place in Zotero's Extra field.

retorquere commented 4 years ago

I'm currently following the Zotero rules for parsing the extra field (to the best of my abilities), and it turns out the place discussion is a long-standing problem. I'll see how I can deviate from that without running into problems when that gets fixed.

retorquere commented 4 years ago

I have a concept for a solution, but it takes a lot longer than I had hoped to implement this.

bwiernik commented 4 years ago

@retorquere Did you figure out which types should have Place mapped to event-place and which to publisher-place?

retorquere commented 4 years ago

I think I'm going to split the behavior between mapping for CSL and for other Zotero concerns. For "Place", which doesn't have an unambiguous mapping to CSL, I'm considering leaving it in the extra fields where it can be picked up by a postscript. If you want event-place you could use that and it would map unambiguously to Place for CSL->Zotero mappings.

retorquere commented 4 years ago

Figuring out a consistent mapping has been a challenge however. I feel like I'm closing in, but I thought that a week ago too, and I spent most of my waking hours last week to try to get it done.

bwiernik commented 4 years ago

The current Zotero behavior is to map Place to both event-place and publisher-place. Doing that would make pandoc citations mirror Zotero Word processor citations. “Event place” and “Publisher place” obviously should just map to the one.

retorquere commented 4 years ago

“Event place” and “Publisher place” obviously should just map to the one.

That's not what @qak is looking for though. The problem is exactly that these two map to one field, where he wants to distinguish between them. 1-to-many is less problematic than many-to-one.

bwiernik commented 4 years ago

No, the problem is that Place (the Zotero field) is currently only getting mapped to one, but should be mapped to both to match Zotero's (problematic) behavior. The CSL variables Event Place (event-place) and Publisher Place (publisher-place) should only be mapped to the actual CSL variable.

retorquere commented 4 years ago

Right, I'm testing that setup right now.

bwiernik commented 4 years ago

All "Place" fields in Zotero are mapped to both event-place and publisher-place because that field predates Zotero adopting CSL, and there was only one "Place" field for all item types. It's currently not possible for Zotero to map "Place" to one or the other across different items, hence the annoying one-to-many mapping.

retorquere commented 4 years ago

I have a deterministic mapping derived from the schemas.

retorquere commented 4 years ago

Still -- that deterministic mapping leaves me with a bit of a problem. If I'm exporting to CSL, and there's place in extra, if I copy that to both event-place and publisher-place, a postscript can't distinguish between there having been place: in the extra field, or two distinct event place: and publisher place: lines. Postscripters would have to check whether the values happen to be the same. Ugh. That's what I'm going to do though. I've spent more than enough time on this.

qak commented 4 years ago

Postscripters would have to check whether the values happen to be the same.

I think that this limitation shouldn't be much of a problem. The ability to programmatically extract both Event Place and Publisher Place would resolve the issue in any case. Thanks for taking the time to look into this!

retorquere commented 4 years ago

I'm finally in the final stages of running my tests, and most of it is looking good, but I'm now hitting an (old) issue that I don't quite know what to do with.

A visualization of the current mapping can be found here. I'm using yEd top open it. Quick rundown:

  1. white nodes are the labels you could expect to enter in the extra field (they're lower case in the graph, but casing doesn't matter when you enter them in the extra field).
  2. darker green nodes are zotero fields
  3. light green fields are CSL fields
  4. black arrows represents a directed mapping between CSL and Zotero fields as I find them in the Zotero/Juris-M schema
  5. grey dashed arrows are a direct mapping in the schema files but which I've removed because they're ambiguous overwrites. The number besides the arrow tells which edges contributed to this. An example is place: both the event place and publisher place keywords would "write" to the place zotero field, meaning potential data loss. event-place and publisher-place would remain available to postscript, but in an export to zotero fields (which fuels the bibtex and biblatex output), they'd both not show up unless explicitly acted upon (which bblt does in places)
  6. blue dashed lines are safe inferences from a variable mapping; an example is the place label which can safely write to the CSL fields event-place and publisher-place. The numbers besides the arrows show what route led to the inference.

I know this looks a little complicated but this was the easiest way to visualize the rats nest of field mappings.

I'm still not really happy with mappings like place for CSL. I have a sample with place: <something> in the extra field, and it looks weird to me to have that show up in both the event-place and the publisher-place field.

retorquere commented 4 years ago

Another option for point 5 is to allow overwrites (with potentially arbitrary outcomes) but leave the extra fields available for postscripts to correct the situation. I'm no great fan of "arbitrary" though.

retorquere commented 4 years ago

I had totally forgotten how much fun it was to work with graphs.

retorquere commented 4 years ago

The current Zotero behavior is to map Place to both event-place and publisher-place. Doing that would make pandoc citations mirror Zotero Word processor citations. “Event place” and “Publisher place” obviously should just map to the one.

That's what it does now, but if at all possible, now's the chance to not just implement the existng workaround. It is, after all, supposedly better CSL JSON.

retorquere commented 4 years ago

I'm hitting an issue now about container. container is not a currently mapped field, so I have to make a choice myself. The CSL var spec says its type is date, with a less than helpful description of ?.

bwiernik commented 4 years ago

No one knows what container is supposed to be used for, and it is marked for deprecation. I honestly would just ignore it.

If you want, I can put together a suggestion for which place field Place should go to for each Zotero type. Mostly it would be publisher-place except for Presentation (CSL speech). Conference Paper (CSL paper-conference) should still be publisher-place because that field is intended for the place of the publisher of the published proceedings, not the location of the conference (cf. the Proceedings Title vs the Conference Name fields).

retorquere commented 4 years ago

Oof... per-item-type mapping should be technically possible but I'd have to rethink the architectore for the mapping I have. Hmm, I can maybe add tags to the graph... let's give it a go and see how complicated things would get. Can't commit to it though.

bwiernik commented 4 years ago

Organized by Zotero type or CSL type?

retorquere commented 4 years ago

Errrr.... conceptually it'd be best to do this by csl type I think?

retorquere commented 4 years ago

And I think I have an idea on how I could do this... hmm...

bwiernik commented 4 years ago

Regarding (5) above, this is a very niche issue. In most cases, users will be entering place information into the proper Zotero Place field. If something is in Extra, it is usually to force Zotero to only map to either event-place or publisher-place or to provide separate values for the two (e.g., to give a publisher location and an event location for paper-conference).

For this latter case, I'm not sure how a flow that first pushes into Zotero fields, then back out into BibTeX fields would work. Could the place fields be directly mapped to their correct Bib(La)TeX fields, rather than first into the Zotero schema?

There is also the issue that BibTeX's usage of fields is bizarre here. In my experience, styles that include the place for a published proceedings item want the publisher location, not the conference location, so BibTeX's inconsistent use of address across types is a problem (cf. here). BibLaTeX is obviously better here with separate location and venue fields for the publisher and event locations, respectively.

All that said, here is what a generic Place field should map to for each CSL/CSLm type. Basically, everything should map to publisher-place except speech. Some types, such as interview, hearing, personal_communication, and paper-conference might be expected to have both types of places, but the primary one used in citations would be publisher-place.

retorquere commented 4 years ago

For this latter case, I'm not sure how a flow that first pushes into Zotero fields, then back out into BibTeX fields would work. Could the place fields be directly mapped to their correct Bib(La)TeX fields, rather than first into the Zotero schema?

Yep. That is what I now do; when extracting for bibtex I extract variables "zotero-oriented", but do not write event-place to place because that could get data loss. event-place and things like original-date, which doesn't have a zotero equivalent at all, stay in csl format, available to the BBT translators, and I decide on them in code.

All that said, here is what a generic Place field should map to for each CSL/CSLm type. ... Basically, everything should go to publisher-place except speech. Some types, such as interview, hearing, personal_communication, and paper-conference might be expected to have both types of places, but the primary one used in citations would be publisher-place.

That's just for place though. There's more multiple-mappings; dimensions, container title and references for example. Try looking at the graph, it's pretty.

bwiernik commented 4 years ago

(I'm not really following the numbers next to the dashed lines. How do I know what "13" means?)

Are the collisions you are worried about if the user supplies a value in Extra but there is already a value for the variable in a proper Zotero field? If that's the case, these are the resolution rules used by citeproc-js:

In terms of the specific gray collision lines in your graph, most of them aren't a problem I don't think. These are just type-specific "localizations" or labels for the generic term. Multiple of these variables in a class don't occur within a single item type. That these don't all collapse to the same internal Zotero database variable reflects its history of the database structure coming before the CSL adoption (or else, runningTime and artworkSize would be internally mapped to a common dimensions variable in the same way bookTitle is mapped to publicationTitle).

  1. dimensions you shouldn't ever encounter a collision because no Zotero type has both runningTime and artworkSize. Those are just the type-specific "localizations" of the general dimensions variable
  2. Same with references. history and references are just the type-specific "localizations" of the references variable for patent (references) and other legal types (history)
  3. Same with container-title. code and reporter are just the type-specific "localizations" for publicationTitle for cases and legislation.
  4. Same with authority. These are just the type-specific wordings of the general authority variable.
  5. Same with volume. codeNumber is just a specific wording of volume for legislation
  6. Same with call-number. applicationNumber is just the wording for patent. You shouldn't ever have a collision for those.
  7. Same with medium. system is a type-specific label for medium for computerProgram items (though this is a pretty stupid mapping in my opinion that I've recommended be dropped).

In contrast,

  1. series and seriesTitle is a bit of a mess. Even the Zotero variables series, seriesTitle, and seriesText are a jumble with unclear distinction. Here, seriesTitle is the rarely used variable, so if there is a collision, I would suggest prioritizing series in the map to collection-title.
bwiernik commented 4 years ago

The "last encountered" behavior is also what Zotero CSL JSON and Better CSL JSON currently do if both series and seriesTitle are supplied in proper fields.

blip-bloop commented 4 years ago

:robot: this is your friendly neighborhood build bot announcing test build 5.2.22.6531 ("test cases for new mapping")

Install in Zotero by downloading test build 5.2.22.6531, opening the Zotero "Tools" menu, selecting "Add-ons", open the gear menu in the top right, and select "Install Add-on From File...".

retorquere commented 4 years ago

(I'm not really following the numbers next to the dashed lines. How do I know what "13" means?)

The numbers don't mean anything, it's just that edges that have the same number do something together. In the case of a grey-dashed edge, there will be black edges with the same number that explain why it's removed.

Are the collisions you are worried about if the user supplies a value in Extra but there is already a value for the variable in a proper Zotero field? If that's the case, these are the resolution rules used by citeproc-js:

* name variables can have multiple entries

I have that

* For other variables, the last encountered value in Extra wins

I have that too

* Values in Extra don't override proper Zotero fields except:

  * date variables
  * item type

I'll have to change this. This surprises me. Wouldn't something entered in the extra field be considered to be more deliberate by the user?

1. `dimensions` you shouldn't ever encounter a collision because no Zotero type has both runningTime and artworkSize. Those are just the type-specific "localizations" of the general `dimensions` variable

But a user could enter both in the extra field. This goes for all these.

In contrast,

1. series and seriesTitle is a bit of a mess. Even the Zotero variables series, seriesTitle, and seriesText are a jumble with unclear distinction. Here, seriesTitle is the rarely used variable, so if there is a collision, I would suggest prioritizing series in the map to `collection-title`.

Alright, that's doable I think.

retorquere commented 4 years ago

dimensions you shouldn't ever encounter a collision because no Zotero type has both runningTime and artworkSize. Those are just the type-specific "localizations" of the general dimensions variable

In such a case I would have expected a baseField mapping to exist for those. If that existed, all of these double mappings would probably disappear. But since there isn't, if you change item type from film to artwork, runningTime is lost. Which isn't too strange -- a runningTime of 2 hours doesn't have a sensible translation to a artworkSize for a painting.

I can add a mapping-specific baseField-like mapping for these fields. That may well resolve the lot.

bwiernik commented 4 years ago

I don't really care for users entering Zotero labels rather than CSL variable names in Extra, but I get that's a possibility. Still though, I think the type-specificity of most of these labels makes collisions a rare possibility. If they do occur, I think the last-encountered rule is a reasonable behavior.

Regarding Zotero fields vs Extra getting priority, I think there were two arguments. First, Frank was leery about being too aggressive with the Extra "cheater" syntax. The date and type overrides are there really to overcome major limitations of the Zotero object model (missing CSL types and a relatively inflexible date parser). Second, Extra will get preserved if an item is duplicated, has the type changed, etc., so there is reasonable possibility of bad data in Extra that might escape users' notice more than proper fields.

bwiernik commented 4 years ago

I think if Zotero's object model were built today, it would have baseField mappings for many of these, but changing item types and fields hasn't been possible until recently because of the syncing architecture.

retorquere commented 4 years ago

I'll just do the faux basefields and the conservative read from extra and see where that gets me. If that works, it would keep things simpler.

bwiernik commented 4 years ago

I think the only things that would remain to be addressed in that case is the handling of Place and series/seriesTitle.

retorquere commented 4 years ago

I don't really care for users entering Zotero labels rather than CSL variable names in Extra, but I get that's a possibility.

The graph hides those for readability, but what happens is if there is a [a-zA-Z_-]+: .* line in the extra, I pick up the part before the : and transform it using

label.replace(/[-_]/g, ' ').replace(/([a-z])([A-Z])/g, '$1 $2').toLowerCase()

and then test whether it's any of those white labels. This means event-place and event place (and EvEnt-PlaCe, but we obviously would not encourage this) all end up as event place in the matching process. Inside the translators, these would always show up as event-place in the parsed extra fields. They wouldn't automatically show up in non-CSL-based translators, but there I pick them out individually as I write out fields from Zotero or "extra-CSL" to bib(la)tex.

Zotero labels get the same treatment so you can enter either. AAMOF the old-style cheater syntax {:publicationTitle:stuff} will also work, although I wouldn't expect people to use that.

retorquere commented 4 years ago

Wait, the faux-basefields don't solve anything. if I find dimensions in the extra field I will just treat those as both artworkSize and runningTime, and they're not even written out now, so that's not at all interesting right now. More interesting is call number. I do write out both callNumber and applicationNumber and they mean different things. And right now, you can't say "no, just fill callNumber, not applicationNumber" (where you can say "fill applicationNumber but not callNumber.

retorquere commented 4 years ago

Wait, that's an error anyhow. If there's a direct edge between a label and a domain (zotero/csl) var, it should not also infer a longer route to a var in the same domain. That's just an error.