joostkremers / parsebib

Elisp library for reading .bib files
BSD 3-Clause "New" or "Revised" License
35 stars 9 forks source link

Optional `strings` parameter for string replacements in `read-*` functions #6

Closed malb closed 7 years ago

malb commented 7 years ago

If an optional strings parameter is supplied, this is now used to perform string replacements, cf #5.

Tested with this file:

@string{springer =              "Springer, Heidelberg"}
@string{cryptoname =            "CRYPTO"}
@string{cryptopub =             springer}
@string{cryptomonth =           aug}
@string{cryptoaddr =            ""}
@string{crypto81 =              "C81"}
@string{crypto81key =           "CRYPTO 1981"}
@string{crypto81name =          cryptoname # "'81"}
@string{crypto81ed =            "Allen Gersho"}
@string{crypto81vol =           "ECE Report 82-04"}
@string{crypto81month =         ""}

@InProceedings{C:Shamir81,
  author =       "Adi Shamir",
  title =        "The Generation of Cryptographically Strong Pseudo-Random Sequences",
  pages =        {1},
  crossref =     crypto81,
}

@Proceedings{C81,
  title =        crypto81name,
  editor =       crypto81ed,
  booktitle =    crypto81name,
  volume =       crypto81vol,
  address =      cryptoaddr,
  month =        crypto81month,
  publisher =    ucsb-ece,
  series =       "",
  year =         1981,
  key =          crypto81key,
}

and this function

(defun test-parsebib ()
  (interactive)
  (goto-char (point-min))
  (let* ((ht-strings (make-hash-table :test #'equal :size 2048))
         (entry-type (parsebib-find-next-item))
         (strings ())
         (entries ()))

    (while entry-type
      (cond
       ((string= (downcase entry-type) "string")
        (let ((entry (parsebib-read-string (point) ht-strings))) 
          (if entry
              (progn (push entry strings)
                     (puthash (car entry) (cdr entry) ht-strings)))))
       ((not (member-ignore-case entry-type '("preamble" "string" "comment")))
        (add-to-list 'entries (parsebib-read-entry entry-type (point) ht-strings)))) 
      (setq entry-type (parsebib-find-next-item)))
    (message "%s" entries)
    entries))
malb commented 7 years ago

Note that this changes the behaviour of parsebib-read-string and parsebib-read-entry somewhat:

  1. parsebib-read-string it is now more liberal in what it accepts and
  2. both functions strip "\"" from the strings/fields. I couldn't make BibTeX string concatenation work otherwise.

Both behaviours might be undesirable (?)

joostkremers commented 7 years ago

This looks great, thanks!

  1. parsebib-read-string it is now more liberal in what it accepts and

How, exactly? I'm looking at the code but I can't figure it out. :smile: I noticed you changed some of the regexes, but that would seem to make parsebib-read-string less liberal.

  1. both functions strip "\"" from the strings/fields. I couldn't make BibTeX string concatenation work otherwise.

I don't see that as a problem.

malb commented 7 years ago

Consider

@string{acisp15month =          jun # "~/~" # jul}

currently, parsebib would return ("acisp15month" . "\"~/~\"") with this PR it will return ("acisp15month" . "jun~/~jul") (possibly with replacements applied if they are provided as a dict) I believe this to be correct, but this is what I meant by "more liberal"

malb commented 7 years ago

I'll look at that conflict, too

joostkremers commented 7 years ago

I'll look at that conflict, too

No need, it's just a name clash (I renamed a variable). It's already fixed in my local repo, but I have a few more changes that I'd like to test first before pushing them to github.

joostkremers commented 7 years ago

Consider

@string{acisp15month =          jun # "~/~" # jul}

currently, parsebib would return ("acisp15month" . "\"~/~\"") with this PR it will return ("acisp15month" . > "jun~/~jul") (possibly with replacements applied if they are provided as a dict) I believe this to be correct, > but this is what I meant by "more liberal"

Thanks. I think that would actually qualify as "more correct" rather than just "more liberal". I never realised that string definitions could contain other abbrevs. (Yeah, I know, your example bib file has it, so I could have known. :wink: )

Thanks for the code. It's already in my local repo's master and I'll push it to Github in the next few days. Then @tmalsburg can update helm-bibtex.

malb commented 7 years ago

Thanks!

joostkremers commented 7 years ago

Sorry, I should have realised right away that the double quotes are important. That is, in your example file above, the line:

@string{cryptopub =             springer}

should be parsed as ("cryptopub" . "springer"), but the line:

@string{cryptoname =            "CRYPTO"}

should be parsed as ("cryptoname" . "\"CRYPTO\"").

Furthermore, your code doesn't seem to handle braces {} correctly. Note that @strings can also be defined as:

@string{cryptoname =            {CRYPTO}}

which is equivalent to using "CRYPTO". Parsing this with your code yields ("cryptoname" . "CRYPTO}"). What it should yield is ("cryptoname" . "{CRYPTO}").

What's also not handled correctly is a field value like this:

publisher =    ucsb-ece # {, Germany},

When expanded, this should yield the string "Springer, Heidelberg, Germany", but instead it yields "ucsb-ece{, Germany}".

The reason parsebib needs to be this precise is that Ebib is not just a front-end for displaying the contents of bib files (as helm-bibtex is), but also for editing bib files. So it needs to know whether a field value is simply a string value or whether it contains abbreviations.

Another thing: it occurred to me that it's not safe to split a string/field value on #. On the one hand, the white space is optional, so BibTeX (well, Biber, since that is what I tested with, but I suspect BibTeX behaves the same way) accepts the following just fine:

note = {Note: }#SLL

SLL is expanded just fine here. Furthermore, if we'd change the code to split on # without the spaces, we run the risk of wrongly splitting field values that contain a # as part of normal text. (And note that in something like \verb+#+ the # sign can appear in e.g., a title without having to be written as \#).

I'll take a closer look at the code in the next few days and see what I can do to improve it.

malb commented 7 years ago

Thanks, could you push your merged branch somewhere so I could work off that in my attempts to fix these?

malb commented 7 years ago

I think the right thing™ to do is to separate out the two use cases. In ebib you do not want any automatic replacements to be applied, since it's also an editor. In helm-bibtex I'd argue those are desired. Thus, the default should be to return the raw string with no replacement/modification whatsoever. When a hash table is provided, though, a fixed version of my replacements should be applied. Does that sound right?

malb commented 7 years ago

I've just pushed some chances. Running

(progn 
  (message "W/O %s" (parsebib-read-string 203))
  (message "W/O %s" (parsebib-read-string 246))
  (message "W/O %s" (parsebib-read-string 288))
  (message "W/O %s" (parsebib-read-string 330))
  (message "W/O %s" (parsebib-read-string 379))
  (message "W/O %s" (parsebib-read-string 424))

  (setq test-hash (make-hash-table :test #'equal))
  (puthash "ucsb-ece" "Springer, Heidelberg" test-hash)

  (message "W %s" (parsebib-read-string 203 test-hash))
  (message "W %s" (parsebib-read-string 246 test-hash))
  (message "W %s" (parsebib-read-string 288 test-hash))
  (message "W %s" (parsebib-read-string 337 test-hash))
  (message "W %s" (parsebib-read-string 379 test-hash))
  (message "W %s" (parsebib-read-string 424 test-hash)))

on

@string{ucsb-ece = "Springer, Heidelberg"}
@string{cryptopub =             ucsb-ece}
@string{cryptoname =            "CRYPTO" # "81"}
@string{cryptoname =            {CRYPTO}}
@string{publisher =    ucsb-ece#{, Germany}}
@string{note = {Note: }#SLL}

outputs

W/O (ucsb-ece . "Springer, Heidelberg")
W/O (cryptopub . ucsb-ece)
W/O (cryptoname . "CRYPTO" # "81")
W/O (cryptoname . {CRYPTO})
W/O (publisher . ucsb-ece#{, Germany})
W/O (note . {Note: }#SLL)
W (ucsb-ece . "Springer, Heidelberg")
W (cryptopub . "ucsb-ece")
W (cryptoname . "CRYPTO81")
W (cryptoname . "CRYPTO")
W (publisher . "ucsb-ece, Germany")
W (note . "Note: SLL")

This seems correct (?) In related news, as you can tell from my test code, my elisp sucks. Also, perhaps overloading the strings parameter with meaning like this will lead to confusion down the line?

joostkremers commented 7 years ago

the default should be to return the raw string with no replacement/modification whatsoever. When a hash table is provided, though, a fixed version of my replacements should be applied. Does that sound right?

Yup, I was thinking the same thing.

joostkremers commented 7 years ago

I pushed my changes to a new branch malb-master (now ineptly named, because there is much more than just your changes, but well). I have a few hours to kill, so I'll see what I can come up with.

joostkremers commented 7 years ago

Note, I meant these:

@string{publisher =    ucsb-ece#{, Germany}}
@string{note = {Note: }#SLL}

not as @strings but as field values in some entry. (Though it won't make that much of a difference when it's just for testing).

joostkremers commented 7 years ago

Also, perhaps overloading the strings parameter with meaning like this will lead to confusion down the line?

You mean using it to pass the string replacements and to signal the values can be expanded? I don't think so, because it doesn't make sense to pass a strings hash without wanting to do the expansions.

joostkremers commented 7 years ago

TBH, I don't think simply removing all braces is the way to go. It's possible to have e.g. \{ \} in a field value, and those should be retained, even in the expansion, I think. Your current code would remove them, but leave the backslashes. I admit that such cases will be rare, but I'd rather get it right.

Ebib has code to detect whether a given field value contains abbrevs, which I think is fairly reliable (it's in ebib-db.el, specifically ebib-db-unbraced-p), but doesn't split the different parts so that the abbrevs can be expanded.

joostkremers commented 7 years ago

I just pushed an update to malb-master that hopefully does the right thing. Seems to work fine when testing on your test file. Could you check it out as well and see if the results you get are what you expect? If yes, I can move the code to master and helm-bibtex can start making use of it.

malb commented 7 years ago

This is great. I've pushed one small bug fix to my clone which falls back to returning the string as is if all else fails. Otherwise, I was having trouble reading some fields such as authors in:

@InProceedings{ACISP:LMLLGY16,
  author =       "Min Cherng Lee and
                  Robin Mitra and
                  Emmanuel Lazaridis and
                  An Chow Lai and
                  Yong Kheng Goh and
                  Wun-She Yap",
  title =        "Statistical Disclosure Control for Data Privacy Using Sequence of Generalised Linear Models",
  pages =        "77--93",
  doi =          "10.1007/978-3-319-40253-6_5",
  crossref =     acisp16-1,
}

Note the line breaks in the authors field.

joostkremers commented 7 years ago

I've pushed one small bug fix to my clone which falls back to returning the string as is if all else fails.

Hmm. I missed that.

Note the line breaks in the authors field.

Yes, thanks for pointing that out. I changed it so that when string abbrevs are expanded, newlines in field values are removed and multiple spaces are reduced to one. That seems to make sense, because the field values are meant for display.

joostkremers commented 7 years ago

I merged the branch into master, so in a few hours it should be available for helm-bibtex to make use of.