String definitions are always returned braced, even when they are unbraced in the file

joostkremers / parsebib

Elisp library for reading .bib files

BSD 3-Clause "New" or "Revised" License

35 stars 9 forks source link

String definitions are always returned braced, even when they are unbraced in the file #16

Closed Hugo-Heagren closed 2 years ago

Hugo-Heagren commented 2 years ago

I have the following string definitions:

@String{up = {University Press}}
@String{cup = "Cambridge " # up}
@String{oup = "Oxford " # up}

Bibtex and biblatex both understand such recursive strings. Ebib does not (yet), so I was working on a PR to fix this. Importantly, given the above configuration cup will render as Cambrige University Press, but given this config it won't:

@String{up = {University Press}}
@String{cup = {"Cambridge " # up}}

In this case, Bib(La)TeX recognises that the value is braced and therefore 'protected'. I found I couldn't distinguish the cases in ebib, because for both configurations it returns the abbrev's value/definition as braced (that is, both case it returns the string {"Cambridge " # up}. I had to unbrace this to get the expansion to work, but that misses the situations where there are intentional braces to protect the values.

Tracing the code, ebib just returns whatever is in its database, and that is determined by parsebib. Honestly the code is a bit dense for me, but I'm fairly sure the problem is in parsebib-read-string.

(As a side note, looking at parsebib's docstrings, it seems that it can expand more complex strings in field values like "Some " # up. I was writing a PR which modified ebib-get-string to split such strings and process them recursively. Is this worth it, or is there an easier way with parsebib?

joostkremers commented 2 years ago

(that is, both case it returns the string {"Cambridge " # up}.) I had to unbrace this to get the expansion to work, but that misses the situations where there are intentional braces to protect the values.

I'm not sure I understand you correctly. You say that biblatex doesn't recognise @String{cup = {"Cambridge " # up}} as a recursive abbrev, so isn't it correct that Ebib doesn't, either?

I mean, if I put the entire expansion in braces, including the # up part, as in:

@String{cup = {"Cambridge " # up}}

above, I wouldn't expect up to be expanded, right?

(Perhaps you meant @String{cup = {Cambridge } # up}? Or perhaps I misunderstand what you're trying to say.)

Tracing the code, ebib just returns whatever is in its database, and that is determined by parsebib.

Yes, but that's by design. Parsebib can either return the contents of the .bib file as is, or return it in a form that's suitable for human consumption. In the latter case, however, it's not possible to reconstruct the original contents from the data that parsebib returns. Since Ebib must be able to write the data back to the .bib file (and do so in a manner that is suitable for version control), Ebib has parsebib read the data as-is.

If asked to expand @Strings, parsebib does handle such expansions correctly, but not in a way such that you could simply plug in some parsebib function and have it work, because parsebib expands @String abbrevs while it is reading the .bib file and stores the result of the expansion. So suppose you have:

@String{up = {University Press}}
@String{cup = {Cambridge } # up}
@String{ncup = {New } # cup}

When parsebib reads the second @String, it stores the value of cup as "Cambridge University Press", not as "{Cambridge } # up".

As a result, when parsebib then reads the third @String and expands the value {New } # cup, it doesn't need to do a recursive lookup for cup, because cup is already associated with the value Cambridge University Press in the hash table.

So this won't help you solve the problem in Ebib. In Ebib, you'd need to do the expansion recursively. Something like (untested):

(defun ebib--expand-@strings-in-value (value)
  (cond
   ((ebib-db-get-string value ebib--cur-db 'noerror))
   ((not (ebib-unbraced-p value)) (ebib-unbrace value))
   (mapconcat #'ebib--expand-string (split-string value " # ") "")))

Mind you, this assumes that the # character is always surrounded by spaces. Parsebib also makes this assumption, though I honestly don't know if it's valid. (So far, no-one's ever complained about it, though. :smile: ) And yes, this function will try to find an expansion for such strings as "{Cambridge } # up", but it's simpler to just let that fail than to first filter out strings that we can be sure won't have an expansion.

BTW, perhaps you also want to reduce sequences of whitespace characters to a single space, as parsebib does. See parsebib--expand-strings for details. (Note that parsebib cannot use ebib--unbrace, in case you're wondering about the string-match in that function. The string-match would of course get things like {Cambridge} # u # {Press} wrong, but such strings never reach parsebib--expand-strings, because they're split on # before that.)

Hugo-Heagren commented 2 years ago

Thanks for the detailed answer!

I'm not sure I understand you correctly. You say that biblatex doesn't recognise @String{cup = {"Cambridge " # up}} as a recursive abbrev, so isn't it correct that Ebib doesn't, either?

Sorry, I'll try to be a bit more clear. First, that's right: biblatex doesn't recognise @String{cup = {"Cambridge " # up}} as a recursive definition, so neither should ebib. It doesn't at the moment, and that's good.

The first problem is that at the moment, ebib doesn't recognise any recursive strings. Biblatex would recognise @String{cup = "Cambridge " # up} as recursive and expand it to Cambridge University Press, but ebib doesn't. I was trying to solve this (by writing something similar to what you suggested). In doing so, I had to distinguish between these two cases:

@String{cup = {"Cambridge " # up}} shouldn't be expanded
@String{cup = "Cambridge " # up} should be expanded

The difference is whether they are stored with {braces}, so I needed to test that. But whichever definition I have in the file, (ebib-get-string "cup" ebib--cur-db) returns the same thing: the string {"Cambridge" # up}. This is the second problem. Because they both return the same thing, I can't test whether the definition is braced or not, so I can't distinguish the two cases. And this functionality seemed to come from parsebib, not ebib.

(Perhaps you meant @String{cup = {Cambridge } # up}? Or perhaps I misunderstand what you're trying to say.)

In fact, I think cases like this can be handled already. Having split such a string on "#" (and accounted for whitespace, as you mention below), some of the resulting elements will be braced and some will not (similarly, some might be "quoted", and some might not). I can test for this in the normal way when expanding each element. The code I was working on already accounted for "quoted" strings in this way and it worked in all the cases I came across.

Tracing the code, ebib just returns whatever is in its database, and that is determined by parsebib.

So from what's above I hope my original comment is a bit clear. What I meant was: whether the in-file definition is braced or not, ebib-get-string always returns a braced string. I couldn't find a reason for this in ebib itself, but I did find that what is returned depends on a parsebib function, which I got lost in, so I assumed that this behaviour was due to parsebib.

Yes, but that's by design.

I think you are referring to something slightly different: the fact that parsebib does not supply expanded versions of strings' defintions to ebib when queried. This makes sense to me, otherwise ebib would write Cambridge University Press as a simple string (without the "#"s or other string uses) as the value of cup when saving strings, which would be wrong. But that doesn't immediately explain to me why strings' definitions are always returned braced, whether they are originally braced or not.

So this won't help you solve the problem in Ebib. In Ebib, you'd need to do the expansion recursively. Something like (untested):

Yep, I had something very similar to your code.

Mind you, this assumes that the # character is always surrounded by spaces. Parsebib also makes this assumption, though I honestly don't know if it's valid.

I did consider diving into the bib(la)tex code and seeing what actual limitations are put on these strings but that might be a bit much for me. I'm not really a consumate TeXnician.

BTW, perhaps you also want to reduce sequences of whitespace characters to a single space, as parsebib does. See parsebib--expand-strings for details.

Good tip! My version used (split-string str "#" t "[[:blank:]]"), which trims whitespace from the ends of each part after splitting, which has a similar effect. But I'll have a look at your version, thanks!

Hope that clarifies things.

joostkremers commented 2 years ago

Thanks for the thorough explanation, things are becoming clearer now.

Because they both return the same thing, I can't test whether the definition is braced or not, so I can't distinguish the two cases. And this functionality seemed to come from parsebib, not ebib.

I see what you mean now, and looking at the code, it's definitely Ebib that does this, not parsebib. And it's probably a bug. In fact, it's the "it's a feature, not a bug" kind of bug. :smile:

Ebib stores strings with the function ebib-set-string, which adds braces around the value if they're not there yet. Although I don't remember exactly, I'm pretty sure the idea behind this was that (a) @String values need to be in braces; and (b) a user shouldn't be forced to type these when entering a @String value in Ebib. After all, you don't need to type braces when you type a field value either.

This works fine, as long as you don't want to have @String values that contain @String abbreviations themselves, i.e., the kind of values that you are trying to use. I'm pretty sure that I didn't know that was even possible when I wrote this code.

So that's a genuine bug that needs to be fixed first, before recursive @Strings can be handled correctly. It should be as simple as removing ebib-(set|get)-string and updating the manual to state clearly that @String values must be entered with braces or quotes.

So this won't help you solve the problem in Ebib. In Ebib, you'd need to do the expansion recursively. Something like (untested):

Actually, I don't think it'll work, not even once the bug above is fixed. The first clause of the cond is gonna return "Cambridge" # up when you pass in cup and that'll be it.

I did consider diving into the bib(la)tex code and seeing what actual limitations are put on these strings but that might be a bit much for me. I'm not really a consumate TeXnician.

Let's not go there. :smile:

BTW, perhaps you also want to reduce sequences of whitespace characters to a single space, as parsebib does. See parsebib--expand-strings for details.

Good tip! My version used (split-string str "#" t "[[:blank:]]"), which trims whitespace from the ends of each part after splitting, which has a similar effect.

That will only remove whitespace around #. In parsebib--expand-strings I reduce all whitespace, even inside braces/quotes, because they may mess up the display, esp. if there are newlines. The same consideration applies to Ebib, because if you display the expansion of a string in the entry buffer, you need to be careful that there aren't any newlines in the value that is displayed. (It is possible to display values with newlines, but these need to be carefully prepared. If that sounds brittle, that's probably because it is. :worried: )

Hugo-Heagren commented 2 years ago

probably a bug. In fact, it's the "it's a feature, not a bug" kind of bug. smile

The best kind of bug :smile:

(b) a user shouldn't be forced to type these when entering a @string value in Ebib. After all, you don't need to type braces when you type a field value either. .... It should be as simple as removing ebib-(set|get)-string and updating the manual to state clearly that @string values must be entered with braces or quotes.

Yes, I've been thinking about this too. Users like me will be in the tiny minority though, and even I mostly use strings with braced values. I think the ideal would be if strings had the same editting experience as field values: braced by default when entered, and having those braces concealed in editting (again, like a field) and it should be possible to make them unbraced/'raw' by pressing r. There is currently no prefix argument used by ebib-add-string, so perhaps for users like me, that could be used for signalling that the string about to entered needs to be unbraced from the off.

Furthermore, if we are going to support recursive strings, there may be a difference between the entered text and the expanded value of a given string (and in fact, for users like me this might be quite common). It might be an idea to add an extra column to the strings display buffer, so that the string's name (e.g. cup), unexpanded definition ("Cambridge " # up) and full expansion (Cambridge University Press) can all be displayed at once. The expansion could be displayed in ebib-abbrev-face. Asterisks to signal unbraced definitions (again, like for fields) would also probably be a good idea.

Actually, I don't think it'll work, not even once the bug above is fixed. The first clause of the cond is gonna return "Cambridge" # up when you pass in cup and that'll be it.

Well, for what it's worth, here's what I had. I've been using it locally for the last two days and it works for most of my Strings pretty well. I'm sure there are problems with it, but it's probably a place to start. I'll have a closer look at parsebib once I can and get it just right:

(defun ebib-expand-string (str db)
  "Recursively expand string abbrev STR in DB.
This accounts for constructions which concatenate \"quoted
strings\" and # concatenation, as well as expanding @string
definitions."
  (let ((list (split-string (ebib-unbrace str) "#" t "[[:blank:]]")))
    (mapconcat
     (lambda (str)
       (cond ((string-match "^\\\"\\(.+\\)\\\"$" str) (match-string 1 str))
         ((not (ebib-unbraced-p str)) (ebib-unbrace str))
         ((ebib-get-string str db 'noerror))
         (t str)))
       list "")))

(defun ebib-get-string (abbr db &optional noerror unbraced)
  "Return the value of @String definition ABBR in database DB.
NOERROR functions as in `ebib-db-get-string', which this
functions calls to get the actual value.  The braces around the
value are removed if UNBRACED is non-nil."
  (if-let ((def (ebib-db-get-string abbr db noerror)))
      (let ((value (ebib-expand-string def db)))
    (if unbraced
            (ebib-unbrace value)
      value))))

(one useful improvement I think would be to add a NORECUR argument to ebib-get-string, with which it wouldn't expand anything, and would just return the definition (like "Cambridge " # up). This would be useful for the extra column in the strings buffer I suggested above.)

This might be a good time mention why I'm actually shaking this yak. I noticed that especially for people like me who use a lot of strings, it can be quite confusing that a strings definition is what shows up in the minibuffer for ebib-insert-abbreviation-current-field or when completing on editting a field. This is good behaviour, because after all it is the definition which will be inserted into the field. But I thought it would be useful to see the expanded rendering too, so I was looking at annotating these commands with expanded strings. This is actually not particularly complex, given the right functions for expanding the definitions. So here I am, shaving that particular yak.

joostkremers commented 2 years ago

[ideas for improving the strings buffer]

These all sound like very good ideas. It certainly makes sense to treat string values similar to field values, and showing the expansion in the strings buffer would also be very nice to have. It might be better to only do that for strings that are unbraced, though, because for string values that are braced, the value and the expansion are identical.

Well, for what it's worth, here's what I had. I've been using it locally for the last two days and it works for most of my Strings pretty well. I'm sure there are problems with it, but it's probably a place to start. I'll have a closer look at parsebib once I can and get it just right:

I think your code won't work for cases with more than one level of nesting. For example, suppose you have this:

@String{u = {University}}
@String{up = u # {Press}}
@String{cup = {Cambridge} # up}

Expanding cup, your code would produce "Cambridge{u # {Press}}" or "Cambridgeu # {Press}" if we can store "raw" string values.

You're using basically the same strategy that is used In parsebib (see parsebib--expand-strings; as does my code above, BTW). However, in parsebib, this strategy works because the result of the expansion is stored in the hash table (replacing whatever was read from the file.) So in the above example, the value for up is expanded to "University Press" and this expansion is then stored in the hash table, so that when cup is expanded, the value that is retrieved for up is the expanded version.

In Ebib, this won't work, because the value that your ebib-expand-string retrieves for up is not expanded. So in Ebib, the function needs to be recursive.

(one useful improvement I think would be to add a NORECUR argument to ebib-get-string, with which it wouldn't expand anything, and would just return the definition (like "Cambridge " # up). This would be useful for the extra column in the strings buffer I suggested above.)

I'd prefer to do it the other way around: retrieving the value from the database as-is, without alterations, should be the default, as is the case with ebib-get-field-value and indeed the unbraced argument of the current ebib-get-string. Altering the retrieved value is only done if certain optional arguments are non-nil.

This might be a good time mention why I'm actually shaking this yak. I noticed that especially for people like me who use a lot of strings, it can be quite confusing that a strings definition is what shows up in the minibuffer for ebib-insert-abbreviation-current-field or when completing on editting a field. This is good behaviour, because after all it is the definition which will be inserted into the field. But I thought it would be useful to see the expanded rendering too, so I was looking at annotating these commands with expanded strings. This is actually not particularly complex, given the right functions for expanding the definitions. So here I am, shaving that particular yak.

I agree this would be useful. (In fact, with marginalia, it would even be possible to add the expansion to the choices in the minibuffer when selecting a string abbrev to insert into the current field.)

Hugo-Heagren commented 2 years ago

So that's a genuine bug that needs to be fixed first, before recursive @Strings can be handled correctly. It should be as simple as removing ebib-(set|get)-string and updating the manual to state clearly that @string values must be entered with braces or quotes.

So I thought I would have a look at this today. I'm not sure why you suggest removing the ebib-(get|set)-string functions entirely? They are useful functions and I'm not sure what would replace them. Given my suggestions about the string buffer and handling strings like wouldn't it make more sense to:

add a nobrace argument to ebib-set-string, to allow for the same raw/braced handling behaviour as with feild values
modify ebib-get-string (if it needs modifyng) to be sensistive to the braced/unbraced distinction
Add an argument to ebib-get-string or ebib-get-field-value (or both?) to expand string defitions (otherwise they are returned unexpanded, as you suggest above)

joostkremers commented 2 years ago

So I thought I would have a look at this today. I'm not sure why you suggest removing the ebib-(get|set)-string functions entirely?

That suggestion was made before the bulk of the discussion, before I realised that the problem is more complex than I thought at that point. So you are right, they should not be removed.

add a nobrace argument to ebib-set-string, to allow for the same raw/braced handling behaviour as with feild values

Yes.

modify ebib-get-string (if it needs modifyng) to be sensistive to the braced/unbraced distinction

Yes.

Add an argument to ebib-get-string or ebib-get-field-value (or both?) to expand string defitions (otherwise they are returned unexpanded, as you suggest above)

To both, I'd say. When retrieving a field value, you want to be able to have strings expanded, so you need such an argument in ebib-get-field-value, but the actual expansion can probably be done in ebib-get-string, because that's what ebib-get-field-value uses to get the string value. So the argument needs to be passed on to ebib-get-string.

Hugo-Heagren commented 2 years ago

That suggestion was made before the bulk of the discussion, before I realised that the problem is more complex than I thought at that point. So you are right, they should not be removed.

Ah right, I understand.

When retrieving a field value, you want to be able to have strings expanded, so you need such an argument in ebib-get-field-value, but the actual expansion can probably be done in ebib-get-string, because that's what ebib-get-field-value uses to get the string value. So the argument needs to be passed on to ebib-get-string.

Yes, this sounds like a sensible way to do it. I'll look at implementing this if I have time soon. In the meantime, it might be a good idea to transfer this issue to the ebib repo? The original issue, the actual code changes and most of the discussion are all concerned with it, not parsebib (which is my fault for opening the issue in the wrong place, sorry!)

joostkremers commented 2 years ago

Yes, I guess it's more an Ebib issue now. Do you want to open an issue there?