Closed bdarcus closed 3 years ago
Why is parsing restricted to the buffer?
Well, parsebib
used to be part of Ebib. Since Ebib uses the low-level API (because I want to be able to report errors and continue parsing), I only spun off the buffer-parsing part.
Is it impractical to allow direct file parsing?
No, it would just involve writing a wrapper function that takes a file path, creates a temp buffer, inserts the file with insert-file-contents
and then parses the buffer. It could even take multiple file paths and insert all files into a single buffer before parsing. (Using a single buffer would allow the function to resolve @String
s and crossref
s in one go. It would probably also involve changing the way parsebib
deals with errors, BTW.)
OK, cool.
Consider this a "would be nice to add at some point if you get to it" (or if someone submits a PR) feature request then :-)
Consider this a "would be nice to add at some point if you get to it" (or if someone submits a PR) feature request then :-)
Done. :-)
The wip/csl
branch now has such a function, parsebib-parse
. It takes a file name or a list of files and returns all entries in them in a single hash table.
It accepts a mix of .bib
and .json
files and is mainly meant for packages that want to display the contents of the entries to an end user, such as bibtex-completion and the packages based on it. It doesn't return the @Preamble
or @Comment
s in a .bib
file, but I assume that's not your use case anyway. (If it is, it can easily be added, though.)
Awesome!
I'll copy @tmalsburg, as it seems this could simplify the bibtex-completion
code and also add the CSL json import.
I'll copy @tmalsburg, as it seems this could simplify the
bibtex-completion
code and also add the CSL json import.
Thanks. Joost an I have already discussed how this could be used. Indeed there's a great potential for simplification in bibtex-completion.
I just tested a json file and bib file together.
How are you thinking about dealing with the different key names in csl-json vs bib(la)tex @joostkremers?
Aside from the issue you've already raised about strings vs symbols, the other obvious one is things like "journal" vs "container-title".
Just say that's not parsebib's responsibility (which would be totally reasonable)?
And what have you decided about the strings vs symbols issue?
PS - I did just notice a bug; somehow tags got appended to the doi on the bib import:
(("doi" . "10.1080/13602004.2019.1575022tagsme,student")
Is that possible? I don't think that's my error, as the bib file just has the doi.
I just tested a json file and bib file together.
How are you thinking about dealing with the different key names in csl-json vs bib(la)tex @joostkremers?
Which keys are we talking about exactly?
Aside from the issue you've already raise about strings vs symbols, the other obvious one is things like "journal" vs "container-title".
Just say that's not parsebib's responsibility (which would be totally reasonable)?
For the moment, I decided to blend it out, yes. It could eventually make sense to offer some sort of unified structure, but for me, supporting CSL-JSON in Ebib currently has priority.
Note also that any conversion that parsebib
undertakes on the data slows down parsing. So in general it might be better to do something like:
(or (assoc-string "year" entry 'case-fold)
(assoc-string 'issued entry))
Although that might start to get ugly really quickly...
I am open to suggestions to handle it better. :slightly_smiling_face:
And what have you decided about the strings vs symbols issue?
Entries are represented as alists, and luckily enough Elisp has assoc-string
, which also accepts symbols, but converts them to strings before comparing. It can also take a case-fold
argument. So:
(assoc-string 'author '(("Author" . "Jane Doe") ("Title" . "Some Title")) 'case-fold)
returns ("Author" . "Jane Doe")
, and similarly:
(assoc-string "Author" '((author . "Jane Doe") (title . "Some Title")) 'case-fold)
returns (author . "Jane Doe")
.
So while it would still be necessary to make sure you're asking for the right fields, at least you don't have to worry about passing in the correct type.
... luckily enough Elisp has assoc-string, which also accepts symbols, but converts them to strings before comparing. It can also take a case-fold argument ...
Great; that will go a long way.
I am open to suggestions to handle it better.
Edit: I could imagine a helper function (maybe an adapted bibtex-completion-get-value
?) that wraps assoc-string
and does the mapping when needed, so that one could do (bibtex-completion-get-value "author")
or (bibtex-completion-get-value "issued")
and it would return the right string regardless of source.
Or something similar in parsebib
where, now that I think about, it seems to make more sense?
PS - I did just notice a bug; somehow tags got appended to the doi on the bib import:
(("doi" . "10.1080/13602004.2019.1575022tagsme,student")
Is that possible? I don't think that's my error, as the bib file just has the doi.
That's weird. Are the tags anywhere in the bib file? Would it be possible to send me the file or the entry where this happens?
Would it be possible to send me the file or the entry where this happens?
Actually, in the process of narrowing this down, I realized there was a missing comma; syntax error.
So my fault ;-)
Actually, in the process of narrowing this down, I realized there was a missing comma; syntax error.
So my fault ;-)
:relieved:
Edit: I could imagine a helper function (maybe an adapted
bibtex-completion-get-value
?) that wrapsassoc-string
and does the mapping when needed, so that one could do(bibtex-completion-get-value "author")
or(bibtex-completion-get-value "issued")
and it would return the right string regardless of source.Or something similar in
parsebib
where, now that I think about, it seems to make more sense?
I thought about this, but I haven't made up my mind yet.
There's a related question: Denis explained to me that Zotero uses it own set of fields, which are mapped internally to CSL-JSON fields. So the JSON fields are never visible to the user. A similar strategy might make sense for Ebib, because the JSON fields aren't always that descriptive (cf. container-title
vs. jourrnal
, which was already mentioned above).
So we'd end up with two mappings: one from biblatex fields to csl-json fields, to allow packages using parsebib
to access field values without having to know the format of the underlying file, and another from UI fields to csl-json fields. Defining these in parsebib
would make it easy to unify them across the Emacs eco system. Then again, I'm not sure if it would even be useful to have a set of UI fields for bibtex-completion.
BTW, if parsebib
would define an access function to get the value of a field, one could even go a step further and have that function do @String
expansion and cross-reference resolution as well. Then this wouldn't have to be done during parsing.
Here, too, though, I'm not sure if it would make sense for bibtex-completion. For Ebib, it would; in fact I already do this (at least for cross-references; @String
s aren't expanded).
Yeah, the CSL model is a bit more abstract, and also aimed at output formatting. Hence names like "container-title".
The bibtex-completion front-ends don't actually include any labels in the UIs, so the names are also less important there I think.
The exception is a user wanting to configure the display using the templates. There one uses the field names directly.
With that, there is arguably some advantage with, for example, "container-title" (because it effectively means "journal" or "incollection").
So we'd end up with two mappings: one from biblatex fields to csl-json fields, to allow packages using parsebib to access field values without having to know the format of the underlying file, and another from UI fields to csl-json fields. Defining these in parsebib would make it easy to unify them across the Emacs eco system. Then again, I'm not sure if it would even be useful to have a set of UI fields for bibtex-completion.
I guess I'd have to see the details to know for sure.
You would need such a UI mapping regardless for ebib, right? So just a question of where to put it?
If yes, you could start with it here and see what feedback you get?
Up to you, but do you want to open a new issue for this? You already closed this narrow request :-)
The exception is a user wanting to configure the display using the templates. There one uses the field names directly.
But I guess in that case you want to see the actual field names and you don't want someone's idea of a useful UI to get in the way, right?
I guess I'd have to see the details to know for sure. You would need such a UI mapping regardless for ebib, right?
Yes.
So just a question of where to put it? If yes, you could start with it here and see what feedback you get?
The idea I had for Ebib was to basically copy Zotero's field names and mappings. It has the advantage of not having to come up with a mapping myself, plus people may be familiar with it.
Up to you, but do you want to open a new issue for this?
I probably should. :slightly_smiling_face: But ATM it seems that UI mappings are only going to be used in Ebib, so I'm leaning towards including the mapping there.
Hi both, I wasn't aware that the field names are different for CSL json. This complicates matters a bit. I was hoping that people would be able to use bibtex and json sources side by side but the current design of bibtex-completion assumes one relevant set of field names. In the past this has already been a problem for people who're using biblatex format (e.g. date
instead of year
). I think bibtex-completion will need a complete redesign in order to support all three formats simultaneously.
It would be relatively easy to support just CSL json or just biblatex, but I doubt that this is going to be a satisfying solution for anyone. For instance, even people who are personally using biblatex sometimes need to work with bibtex because many journals require it. Hm ...
Perhaps we'll need separate biblatex-completion and csljson-completion. Users of Helm can easily fuse these together at the UI level. Not sure it will be possible with ivy and completing-read UIs.
I do indeed think ideally one can mix:and-match sources without hassle, for users or developers alike.
That year/date issue has bitten me. With CSL, you also get issued etc.
So we somehow need a mechanism to do this mapping, in an easy and performant way.
So we somehow need a mechanism to do this mapping, in an easy and performant way.
One stupid simple solution would be to convert the CSL json to BibTeX (on disk) and just use that for bibtex-completion purposes. This is what we currently do with org-bibtex. Not pretty but it works and keeps things manageable. I guess any user of CSL json probably wants a BibTeX version anyway for LaTeX authoring?
So we somehow need a mechanism to do this mapping, in an easy and performant way.
One stupid simple solution would be to convert the CSL json to BibTeX (on disk) and just use that for bibtex-completion purposes. This is what we currently do with org-bibtex.
What do you use for conversion? Or am I misunderstanding and isn't it the case that org-bibtex can convert CSL-JSON to BibTeX?
Not pretty but it works and keeps things manageable. I guess any user of CSL json probably wants a BibTeX version anyway for LaTeX authoring?
I wouldn't assume that. Pandoc makes it possible to author publications without going through LaTeX (LaTeX isn't even needed for PDF output, though it's still an option), and with org-cite, Org mode will, as well.
@tmalsburg - correct me if I'm wrong, but the key places where you use the field names to pull data are with the bibtex-completion-get-value
calls?
So what if he added a parsebib-get-value
analog, which handled that mapping?
It might need to settle on one set of field names (edit: at least as fallback?), say biblatex, but then you could do:
(parsebib-get-value "date")
... and it would pull a "year" value from bibtex, or an issued
from csljson.
So we somehow need a mechanism to do this mapping, in an easy and performant way.
Suggestions? The thing is, I'm not even sure how bibtex-completion works, exactly. Personally, I see two ways to convert CSL-JSON fields to BibTeX / biblatex:
Option 1. means that all CSL-JSON fields are converted (optionally only the ones explicitly requested) and their new field names stored back in the individual entry alists. I suspect that would be a highly expensive operation, performance-wise.
Option 2. can probably be implemented more economically, even if it needs to be done for the entire database, because the alists themselves do not need to be modified. But it depends on how bibtex-completion accesses the data, which, as I mentioned, I do not know...
@tmalsburg - correct me if I'm wrong, but the key places where you use the field names to pull data are with the
bibtex-completion-get-value
calls?So what if he added a
parsebib-get-value
analog, which handled that mapping?It might need to settle on one set of field names, say biblatex, but then you could do:
(parsebib-get-value "date")
... and it would pull a "year" value from bibtex, or an "issued" from csljson.
Yes, that's basically what I mean. :slightly_smiling_face:
What do you use for conversion? Or am I misunderstanding and isn't it the case that org-bibtex can convert CSL-JSON to BibTeX?
The conversion is the responsibility of the user. bibtex-completion just assumes that there is an analogous .bib
for every .org
. How exactly users convert to BibTeX I don't know since I'm not using org-bibtex myself.
I wouldn't assume that. Pandoc makes it possible to author publications without going through LaTeX
Sorry, I should have said "most users". Even though it's cool that org is gaining citation capabilities that don't rely on LaTeX, my suspicion is that LaTeX is going to remain the primary way to handle citations for most. I may be totally wrong of course but LaTeX is pretty deeply engrained in the academic publishing system.
So what if he added a
parsebib-get-value
analog, which handled that mapping?
The problem is that the mapping may not be 1-to-1. An example is bibltex date
vs. BibTeX day
, month
, year
. Similar issues may arise with CSL (which I'm not familiar with yet).
It's true data like names and dates are more complicated than simple strings. But it shouldn't be hard to address those.
It's true data like names and dates are more complicated than simple strings. But it shouldn't be hard to address those.
Are there other fields where conversion would be problematic? My impression is that there aren't. And I agree it shouldn't be hard to come up with something.
I could come up with a parsebib-get-value
that assumes biblatex fields and translates them when necessary. I'd start small, with the most important fields, and then add more mappings when they come up.
Are there other fields where conversion would be problematic?
I don't know. You're probably in a better position to tell since you know CSL and I don't.
I could come up with a parsebib-get-value that assumes biblatex fields and translates them when necessary.
But wouldn't it be confusing for a user with a CSL bibliography if they had to specify formatting strings using bibtex field names? Equally, I'd find it annoying if I had to learn biblatex terminology even though my bibliography is in bibtex. It may be the best solution anyway, but it's not pretty.
By the way, I would use BibTeX field names as the default, not biblatex. My experience is that the majority of users are using the BibTeX format (e.g. year
instead of date
) even if they're using biber/biblatex in their LaTeX workflow. Plus, Crossref and basically all journals export classic BibTeX.
Option 2. can probably be implemented more economically, even if it needs to be done for the entire database, because the alists themselves do not need to be modified. But it depends on how bibtex-completion accesses the data, which, as I mentioned, I do not know...
Performance doesn't just depend on bibtex-completion, but also on the UI frontend. Helm is pretty clever in only formatting entries that actually show up on the screen, whereas Ivy (I think) formats all entries (last time I checked). Not sure about the completing-read UI.
Are there other fields where conversion would be problematic?
I don't know. You're probably in a better position to tell since you know CSL and I don't.
In the future, probably titles.
I could come up with a parsebib-get-value that assumes biblatex fields and translates them when necessary.
But wouldn't it be confusing for a user with a CSL bibliography if they had to specify formatting strings using bibtex field names?
Equally, I'd find it annoying if I had to learn biblatex terminology even though my bibliography is in bibtex. It may be the best solution anyway, but it's not pretty.
By the way, I would use BibTeX field names as the default, not biblatex. My experience is that the majority of users are using the BibTeX format (e.g.
year
instead ofdate
) even if they're using biber/biblatex in their LaTeX workflow. Plus, Crossref and basically all journals export classic BibTeX.
This is what I was hinting at above with my note on "fallback".
Bibtex is the older, more limited, format.
But this mechanism could include both. I guess performance could become an issue, depending on the details ...
Performance doesn't just depend on bibtex-completion, but also on the UI frontend. Helm is pretty clever in only formatting entries that actually show up on the screen, whereas Ivy (I think) formats all entries (last time I checked). Not sure about the completing-read UI.
In bibtex-actions
, I can't use bibtex-completion-candidates
without sacrificing things that users want (like match highlighting), so have to recreate my own pre-formatted candidates from that.
It might in theory be better for me to just use parsebib-parse, etc for this directly, but I do need to format the full candidate list upfront, so this would likely be a bottleneck.
Are there other fields where conversion would be problematic?
I don't know. You're probably in a better position to tell since you know CSL and I don't.
Well, I've looked at the spec, but that's about it. :slightly_smiling_face:
But wouldn't it be confusing for a user with a CSL bibliography if they had to specify formatting strings using bibtex field names?
It would still be possible to access the data using the json field names, at least if conversion is done on the fly, not in the database itself. So you could do (parsebib-get-value 'year <json-entry>)
and get the value of the issued
field, but you could also do (parsebib-get-value 'issued <json-entry>)
. The only thing that would not be possible is (parsebib-get-value 'issued <bib-entry>)
. (Unless I also add a mapping from CSL-JSON fields to biblatex fields, of course.)
Equally, I'd find it annoying if I had to learn biblatex terminology even though my bibliography is in bibtex. It may be the best solution anyway, but it's not pretty.
From what I can tell by looking at the helm-bibtex readme, users don't normally have to deal with the field names at all, except when they want to customise the search display. And in that case it's probably safe to assume they know the underlying format well enough.
It would mainly be a convenience for you, so that you don't have to write things like:
(or
(bibtex-completion-get-value 'year entry)
(bibtex-completion-get-value 'date entry)
(bibtex-completion-get-value 'issued entry))
Instead, you could write
(parsebib-get-value 'year entry)
and parsebib
would make sure you get the right value, regardless of the format of entry
.
By the way, I would use BibTeX field names as the default, not biblatex.
Since biblatex is the more expressive format, I would prefer those, because it should in theory be easier to go from biblatex field to BibTeX field than vice versa. Though in practice it might not matter that much.
You do have the stringify
functions, where you can handle the more complex fields.
How would that interact with this; say if you wanted to specify a year or month for a date, main title from a title, last names for authors, etc.?
It would still be possible to access the data using the json field names,
Wouldn't this create room for ambiguity? Say BibTeX has field A that maps to field B in CSL, but CSL also has a field A, then it's not clear which A is being requested. Not sure whether such a scenario will arise, perhaps not, but it's at least technically possible.
It would still be possible to access the data using the json field names,
Wouldn't this create room for ambiguity? Say BibTeX has field A that maps to field B in CSL, but CSL also has a field A, then it's not clear which A is being requested. Not sure whether such a scenario will arise, perhaps not, but it's at least technically possible.
True. I don't think there are many occasions, but at least the type
field comes to mind, which biblatex uses to record subtypes of certain entry types (e.g., Thesis
with type = "PhD Thesis"
), while CSL-JSON uses it to record the entry type.
Perhaps it'd be possible to check for which fields this risk arises and handle them specially.
It's kinda up to you if you want such a mapping or not. :slightly_smiling_face: In Ebib, I already distinguish between BibTeX and biblatex files with their different sets of entry types and fields, it won't be much of a problem to add a third database format. So ATM I don't think I'd be using this mapping.
You do have the
stringify
functions, where you can handle the more complex fields.How would that interact with this; say if you wanted to specify a year or month for a date, main title from a title, last names for authors, etc.?
Not sure what you mean... What scenario do you have in mind, exactly?
Not sure what you mean... What scenario do you have in mind, exactly?
Well, bottom line, this is the two templates I have with their defaults.
So of note, in this line:
'((t . "${author:20} ${title:48} ${year:4}"))
... "author" is actually formatted with some bibtex-completion
function that prints a list of author last names, while "year" obviously will pull bibtex "year", but also (via some other bibtex-completion
code) biblatex "date", and so the "4" just pulls the first four characters.
So bibtex-completion
already does some mapping and data formatting.
I was just wondering how similar could work with a parsebib-get-value
function.
Perhaps "year" would become "date", but template would otherwise stay the same?
I was just wondering how similar could work with a
parsebib-get-value
function.Perhaps "year" would become "date", but template would otherwise stay the same?
My thinking now is that parsebib-get-value
would take an optional argument that controls this behaviour. If nil
, it would just call assoc-string
and return the value. If non-nil
, it would also try alternative fields, based on some schema.
I would probably keep year
, but if that field doesn't exist, check date
as well. If that doesn't yield anything, issued
would be tried next.
Note that this wouldn't necessarily just be to accommodate the different formats. If author
yields nil
, one may well want to get the editor
field instead.
I was just wondering how similar could work with a
parsebib-get-value
function. Perhaps "year" would become "date", but template would otherwise stay the same?My thinking now is that
parsebib-get-value
would take an optional argument that controls this behaviour. Ifnil
, it would just callassoc-string
and return the value. If non-nil
, it would also try alternative fields, based on some schema.
Yeah, was thinking the same. Something like:
(parsebib-get-value 'author entry 'short)
... so we can get rendering like:
Edit: could also do:
(parsebib-get-value 'title entry 'short)
... which now could pull csl-json title-short
if available, on in the future the main title; could split a full title, etc.
Templates could be adapted to support that something like:
{author:15/short}
I would probably keep
year
, but if that field doesn't exist, checkdate
as well. If that doesn't yield anything,issued
would be tried next.Note that this wouldn't necessarily just be to accommodate the different formats. If
author
yieldsnil
, one may well want to get theeditor
field instead.
Right!
Hi both. I'm moving later this week. Lots of boxes to pack. I will catch up with this thread next week.
FYI, @joostkremers, I've opened a linked issue for how I'd adapt bibtex-actions
to this.
Happy to experiment once you have parsebib-get-value
working.
I'd think a similar approach would work for bibtex-completion
.
Caveat: some bibtex-completion
functions, which bibtex-actions
depends on, currently depend on it's parsing code. See, for example, bibtex-completion-show-entry
. But looks like those should be easy enough to adapt to use parsebib-get-value
there.
I also retitled this issue.
Good luck with the moving @tmalsburg!
FYI, @joostkremers, I've opened a linked issue for how I'd adapt
bibtex-actions
to this.
Cool. I've subscribed so I'll be kept up-to-date.
Happy to experiment once you have
parsebib-get-value
working.
About that: I'm not sure we're exactly on the same page here... :slightly_smiling_face: You mention being able to do:
(parsebib-get-value 'author entry 'short)
But I'm not sure what short
should mean. My idea was to have something like:
(parsebib-get-value 'author entry 'alternatives)
where alternatives
indicates that if author
doesn't exist in entry
, it should try editor
next; and if you pass year
:
(parsebib-get-value 'year entry 'alternatives)
you'll get the value of date
if year
does not exist, and of issued
in case date
also does not exist.
If you need parsebib-get-value
to do more, feel free to let me know.
Good luck with the moving @tmalsburg!
Hear, hear!
I misread/got ahead of you earlier.
To clarify, the 'alternatves
idea is a good one, and useful.
What I was referring to there is obviously different, if related, but hopefully self-explanatory.
The author string you have currently for "display", for example, includes the full names, and so something like that example would just ask for the short names; a display variant, if you will.
Not sure it's needed, but are you thinking to have those alternatives configurable somehow?
The author string you have currently for "display", for example, includes the full names, and so something like that example would just ask for the short names; a display variant, if you will.
That would be possible, of course. It could also be done during parsing, BTW. I don't know which option would be better.
Not sure it's needed, but are you thinking to have those alternatives configurable somehow?
The way JSON name fields are stringified during parsing is configurable with the variable parsebib-json-name-field-template
. That could be generalised.
The way JSON name fields are stringified during parsing is configurable with the variable
parsebib-json-name-field-template
.
That's perfect, and is the main thing I need for this.
That could be generalised.
So basically just add a new defvar as needed? You have one for names and another for date, which is really all we need ATM I think.
Now reading the docs again, and this section in particular, is the idea that you are normalizing on EDTF for dates? That sentence that begins "Date fields (as defined by parsebib--json-date-fields) are converted" is a little confusing to me.
Now reading the docs again, and this section in particular, is the idea that you are normalizing on EDTF for dates? That sentence that begins "Date fields (as defined by parsebib--json-date-fields) are converted" is a little confusing to me.
The code might be clearer... :worried: The variable parsebib--json-date-field
holds a list of fields that are date fields. If such a date field's value is a string, it is not modified. If it is an object, it is converted to a string using the template "{circa }{season }{start-date}{/end-date}{literal}{raw}"
. Unlike name fields, however, that template isn't let
-bindable, because it doesn't apply to the fields in the object directly.
The details are in the function parsebib--json-stringify-date-field
, but basically, if a date field just contains a date or a year, the resulting string has the form "2021-4-22"
or "2021"
. If season
or circa
are present, it may also be "Summer 2012"
or ca. 2000
, etc.
parsebib--json-stringify-date-field
has an extra argument short
, which, if t
, returns just the year, which I guess is what you need.
is the idea that you are normalizing on EDTF for dates?
If it is an object, it is converted to a string using the template
"{circa }{season }{start-date}{/end-date}{literal}{raw}"
.
So that template is similar to EDTF.
Any estimate of when you can get back to and merge this @joostkremers?
With org-cite now merged, would be great to get json support in bibtex-completion et al.
With org-cite now merged, would be great to get json support in bibtex-completion et al.
I'm not sure that bibtex-completion can accomodate csl. Csl seems too differ in too many respects and breaks too many assumptions we're making in bibtex-completion. Bibtex-completion has difficulties accommodating even the biblatex dialect, which is not terribly different from bibtex. If we force biblatex and csl into bibtex-completion, my worry is that the code becomes buggy and impossible to understand and maintain. My impression is that we may need separate csl-completion and biblatex-completion modules that can be plugged in elsewhere. For compatibility, their interfaces should mirror the API of bibtex-competion and there is perhaps also some code that can be shared. I think this would give us much better support for each individual format, more flexibility, and more reliable / correct code.
Why is parsing restricted to the buffer?
Is it impractical to allow direct file parsing?