plk / biber

Backend processor for BibLaTeX
Artistic License 2.0
336 stars 37 forks source link

Match-Replace Dictionary in Sourcemapping #375

Closed dollodart closed 2 years ago

dollodart commented 3 years ago

For some entry key you wish to map to integer values for the purpose of sorting where there is a large number of possible inputs. Then the method using match and replace would require something like the following in a Sourcemapping:

\step[fieldsource=sfield]
\step[fieldset=ifield, origfieldvalue] % copy string field to integer field
\step[fieldsource=ifield,match={string1},replace=1]
\step[fieldsource=ifield,match={string2},replace=1]
\step[fieldsource=ifield,match={string3},replace=2]
\step[fieldsource=ifield,match={string4},replace=2]
\step[fieldsource=ifield,match={string5},replace=2]
\step[fieldsource=ifield,match={string6},replace=3]

This is at least inelegant, if not inefficient; depending on how Biber works, it could try equality for every fieldsource which is O(n) rather than hashing which is O(1). The API in biblatex be replaced by

\step[fieldsource=sfield]
\step[fieldset=ifield, origfieldvalue] % copy string field to integer field
\step[fieldsource=ifield, matchkeys={string1,string2,string3,string4,string5,string6},replacevalues={1,1,2,2,2,3}]

This change is other than the API in the dynamic data modification done by Biber so this issue is put in the Biber repo. The efficiency gain is only in the case where the match keys are exact, since if they are regular expression patterns each of the patterns have to be tested against the input string. In many applications string1, string2, and so on may not be literal strings but the locale string, e.g., string1 -> "String 1" or "Saite 1" or so on. This is only relevant if the strings are expanded before the Sourcemapping is done.

plk commented 3 years ago

I do see your point here but this seems to be quite a niche requirement - you would also have to change the datamodel to make sure that the resulting field was sorted as an integer and not a string. Do you have a specific use-case for this sort of thing, just to see if there is some way to do this without implementing new functionality for a very specific case?

dollodart commented 3 years ago

My application is as follows. In legal citations it's common to sort by court supremacy. From Richard Posner's The Bluebook Blues, he has in his citation guide: "The usual citation order is Supreme Court, Seventh Circuit, other circuits, state supreme courts, other state appellate courts, federal and state trial courts.". In this case the map would be something like supreme->1, 7circuit->2, 1circuit->3, 2circuit->3, 3circuit->3, 4circuit->3, 5circuit->3, 6circuit->3, 8circuit->3, 9circuit->3, ..., delawaresupreme->4, pennsylvaniasupreme->4, .... You can take a look at the google case scholar homepage to see how many courts there are, and those are only the highest appeals courts in state and federal jurisdiction. The usual citation order can be changed on a per-entry basis simply by assigning the integer datafield in the entry and setting overwrite=False in the Sourcemapping.

However I don't think this is a niche application. The question is how do you provide sorting on an arbitrary total order, that is, something other than the string's character or numeric properties like an alphabetical (a > b > c ... ) or less than or equal comparison (1 < 2 < 3 ...). Someone might want to sort their articles by journal prestige, for example. Then there would be nature->1, science->1, journalamericanchemicalsociety->2, physicalreviewsb->2, and so on.

plk commented 3 years ago

This is implemented - please try biblatex 3.17 and biber 2.17 from the development folders on Sourceforge. The syntax is:

\step[fieldsource=ifield, matches={string1,string2,string3,string4,string5,string6},replace={1,1,2,2,2,3}]

This is in the docs (biblatex and biber). There is a case-insensitive version matchesi. The match/replace lists (CSVs) must be the same length.

dollodart commented 3 years ago

You have a for-loop that iterates through the matches in lib/Biber/Input/file/biblatexml.pm. A Perl hash data structure, like in https://perlmaven.com/perl-hashes, would in principle be better, even if as a practical matter the running time of Biber may be negligibly changed by this O(n) -> O(1) access. However it's also the case that at the time you do the for-loop, there is only one fieldval variable, from the current etarget (entry) which is being processed, which means you would have to define the Perl hash outside the loop (since constructing the hash table is O(n) from n O(1) inserations). So given the existing program structure it may be optimal.

I tried to make a MWE, but in Sourceforge biber-linux_x86_64.tar.gz is 24.6 MB, but the texlive distribution for biber (put in /usr/bin/biber) is 36KB. There is 17,371,510 characters in the sourceforge development uncompressed biber, but 33,557 characters in /usr/bin/biber. I can't even run biber --help on the development version of biber.

plk commented 3 years ago

lib/Biber/Input/file/biblatexml.pm isn't relevant to your use-case - that's for biblatexML source. After compile, the for loop you mention is equivalent to a hash and we're talking about tight loops of a few items - a negligible performance difference at best and it's more obvious what's happening.

The size difference you are seeing is that /usr/bin/biber is not really relevant - that's just a PAR wrapper that calls the real exe from the cache location which is unpacked on first run - you will be fine if you just drop the dev biber over /usr/bin/biber.