Open cboulanger opened 2 years ago
At what point are you running into problems specifically? I'm guessing you have a set up of marked-up references with the fields "authority" and "reference" marked. Are the equivalent fields in the test dataset still not getting recognised?
I did some work with AnyStyle to recognise similar kinds of non-science documents (regulations, standards). It worked pretty well. Some things you can look at doing are making your own type normaliser (adding rules to recognise, for example, legislation) and/or adding features. For example, you might want to recognise commonly used abbreviations in the relevant field, similar to how the "journal name" feature recognises many common journals. For example, if legal and German, you might want to have a feature that recognise "BVerG" for decisions of the Bundesverfassungsgericht.
@a-fent Thanks! Would you have some code examples on how you did it? My problem is always that I have to learn Ruby while going and even though I have a general grasp of how AnyStyle/Wapiti works, implementing a specific solution is still a challenge for me. Some snippets (or links to published code) on how to implement and use my own features & normalizers would be incredibly helpful.
I would also like to do this for signal phrases such as "see also", "cf.", "on XXXX, see", "for ..., see" etc. which need to be discarded for reference parsing but could also contain useful information (agreement/disagreement) for later analysis
To understand the code better: how/where in the code are features and lables being connected? How would I set up a Feature and tell Wapiti that if I encounter "BVerfG" there is a high chance that this is an "authority"?
I guess what I want to say is that it would be great to have a hands-on documentation/tutorial on how to extend the current feature -> label -> normalizer workflow...
I'll have a go at explaining, if it helps, we could perh. turn it into a FAQ. It's kind of advanced usage, tbh, partly because it's not very well documented in the original wapiti code that ruby-wapiti is based on.
AnyStyle labels each token (word separated by whitespace) based on its features. A feature is something like "how is this word capitalised?" or "is this word part of a known journal name?". Position in the whole string and the labelling of surrounding tokens are also taken into account.
Each feature of each token is observed - that is, it is assigned a particular value. An easy example is observing the capitalisation: https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/feature/caps.rb . This feature of a token can be
A basic "court authority" feature might be something like:
class IsCourt < AnyStyle::Feature::Dictionary
def observe(token, alpha:, **opts)
if token == "BVerfG"
:court
else
:_ # not a court
end
end
end
You also need to tell Wapiti about this feature so that it's included in its model estimation when training. This is the bit that's not well documented - refer to AnyStyle's default pattern: https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/support/parser.txt
I think of this, but I could be wrong, as reserving spaces for the feature values and setting how they are used in the CRF estimation. So for the IsCourt feature you might add lines at the end for the two possible values, :court and something else.
U:Crt-1 X=%x[ 0,20]
U:Crt-1 C=%x[ 0,20]/%x[ 0, 8]
U:Crt-2 X=%x[ 0,21]
U:Crt-2 C=%x[ 0,21]/%x[ 0, 8]
Save this in a new file, e.g. "my_pattern.txt". I'm afraid if I ever knew exactly how this worked, I've forgotten.
You need a training file that should have examples showing the token "BVerfG" linked to the label "authority". When the model is trained, this will link the feature :court to the label authority. Note there is no particular reason you have to restrict yourself to the fields/labels used in CSL, Bibtex or whatever, other than if you want to use your parsed data in a particular way later.
You lastly need your own parser class. Some of this is just boiler-plate
class MyParser < AnyStyle::Parser
@formats = superclass.formats
@defaults = superclass.defaults.update( {
pattern: "path/to/my_pattern.txt"
})
def initialize(*args)
super
# Add your new feature
@features.push IsCourt.new
# then if you want to use your own type classifier, see below
normalizers.delete_if { | norm | norm.kind_of? AnyStyle::Normalizer::Type }
normalizers.push(Classifier.new)
end
You can then use this parser as you would AnyStyle.parser (call #train, #parse etc).
The assignment of a citation to a particular type of document (e.g. journal article, book, PhD thesis) is done after labelling. It is done heuristically based on the presence or absence of particular fields (e.g. a journal name, a publisher) and the values of fields: https://github.com/inukshuk/anystyle/blob/master/lib/anystyle/normalizer/type.rb
You might, for example, want to recognise additional types, such as "statute" or "case" (those look to be relevant types in Zotero). You might say that anything with a "court" label is a case and anything with a "statute name" field is a law. Your custom type normaliser would then have lines like:
when keys.include? :court # this item has a token labelled "court"
'case'
when keys.include? :'statute-name'
'statute'
You'll need to tell your customised Parser class to use your own type normaliser instead of AnyStyle's default - see above.
@a-fent Thank you so much, this is incredibly helpful! Maybe @inukshuk can weigh in on the wapiti pattern issue in the .txt file? I am not sure I fully understand the information that needs to be entered there.
There is some information about patterns in the original wapiti codebase here: https://github.com/Jekub/Wapiti/blob/master/src/pattern.c . This may or may not be illuminating for you.
It also bears mentioning that the fact that such customisations are possible and relatively easy with AnyStyle is down to the clever and elegant design of the software by @inukshuk
I totally agree that AnyStyle has an excellent design - I come from another library which was really hard to work with and really appreciate it! So the only missing piece of the puzzle is how to translate from what I want to add in Ruby to the Wapiti pattern, since from looking at parser.txt, it isn't clear to me at all how to translate the feature selection to this pattern. Maybe Sylvester can tell us more about the details. Does the order or the naming of the pattern matter? What do the different parts of the pattern mean? I.e., in U:Tok-1 X=%x[ 0,0]
, what does "U" signify vs. "*", how do I determine the choice of "X" vs something else, or the number of the "column" 0 ?
As they say, If everything else fails, look at the manual. Here's something from the Wapiti docs (somewhat reformatted):
Pattern files are almost compatible with CRF++ templates. Empty lines as well as all characters appearing after a #
are discarded. The remaining lines are interpreted as patterns.
The first char of a pattern must be either u
, b
or *
(in upper or lower case). This indicates the type of features that will be generated from this pattern: respectively unigram, bigrams and both.
The remaining part of the pattern is used to build an observation string. Each marker of the kind %x[off,col]
is replaced by the token in the column col
from the data file at current position plus the offset off
.
The off
value can be prefixed with an @
to make it an absolute position from the start of the sequence (if positive) and from the end (if negative). An offset of @1
will thus refer to the first symbol of the current sequence and @-1
to the last one.
For example, if your data is:
a1 b1 c1
a2 b2 c2
a3 b3 c3
The pattern u:%x[-1,0]/%x[+1,2]
applied at position 2 in the sequence will produce the observation u:a1/c3
.
Note that sequences are implicitely padded with special tokens such as _X-1
or _X+2
in order to apply markers with arbitrary offset at any position in the sequence. This means, for instance, that _X-1
denotes the left context of the first token in a sequence.
Wapiti also supports a simple kind of matching, that can be useful, for example, in natural language processing applications. This is done using two other commands of the form %m[off,col,"regexp"]
and %t[off,col,"regexp"]
. Both commands will get data the same way the %x
command using the col
and off
values but apply a regular expression to it before substituting it. The %t will replace the data by true
or false
depending if the expression match on the data or not. The %m command replace the data by the substring matched by the expression.
The regular expression implemented is just a subset of classical regular expression found in classical unix system but is generally enough for most tasks. The recognized subset is quite simple. First for matching characters:
. -> match any characters
\\x -> match a character class (in uppercase, match the complement)
\\d : digit \\a : alpha \\w : alpha + digit
\\l : lowercase \\u : uppercase \\p : punctuation
\\s : space
or escape a character
x -> any other character match itself
And the constructs :
^ -> at the beginning of the regexp, anchor it at start of string
$ -> at the end of regexp, anchor it at end of string
* -> match any number of repetition of the previous character
? -> optionally match the previous character
So, for example, the regexp ^.?.?.?.?
will match a prefix of at most four characters and ^\u\u*$
will match only on data composed solely of uppercase characters.
For the commands, %x
, %t
, and %m
, if the command name is given in uppercase, the case is removed from the string before being added to the observation.
So I get that AnyStyle only seems to use token extraction (%x
) and mosty Unigrams. But the rest of how to compose the pattern is still a mystery to me.
Thanks @a-fent for the write-up above! If I remember correctly, when we re-designed the parser the last time we tried to make it possible to use it with different patterns/features even without sub-classing -- though I'm not completely sure we succeeded in this? In any case, I think you could even use the parser as is and remove, add, or manipulate the default features and normalizers. At least that was my intention if you want to make some small adjustments. For bigger changes sub-classing is obviously still the best option.
I think there is a lot that can be done to still improve individual normalizers; it's also fairly trivial to add more normalizers to AnyStyle -- either to the default configuration or optional. Adding more labels or even features is more problematic because some care must be taken that doing does not yield worse parse results for the current set of supported references. But thanks to the gold set I feel like have a fairly good setup in place to protect us from bad regressions.
Wapit's pattern files are a little cryptic, I agree. If I remember correctly, I decided to use only %x
because classifying text is so much easier in Ruby than it is in C/wapiti's other pattern commands. So my approach was to do all the classification or pattern recognition work in the Feature
classes in Ruby in order to keep the pattern files simpler to write.
@cboulanger I suggest to look at some simpler pattern files to get a better understanding of them. You can find some example here. At a very high-level what you need to understand is only that wapiti takes a kind of tabular input. Each line is a token (the first word) followed by a fixed number of 'feature-words'; and a final label (this label is used for training; later on it is the thing that will be predicted by the model). The pattern file is away to give wapiti instructions how to interpret this input (the feature words). You can use the pattern file to extract a lot of information even from very simple inputs -- e.g. the most simple input would be just the token word itself. AnyStyle's approach is to analyze the tokens in Ruby, it's basically a pre-processor to compile the tabular input for wapiti.
While each token is a line; successive lines are a sequence. I basically wrote wapiti-ruby
to make working with these inputs easier (you have Token
and Sequence
classes and also DataSet
, which is a set of sequences -- if I remember correctly that's our own abstraction and not used by wapiti).
AnyStyle's approach is to analyze the tokens in Ruby, it's basically a pre-processor to compile the tabular input for wapiti.
Are you saying that the pattern.txt
file is generated by AnyStyle - so that I don't have to actually understand how to write one? That would be ideal of course! However I understood Alex to say that one needs to manually compose it to correspond to the Ruby feature classes used.
No, the pattern file isn't generated, but you need to write it by hand only if you want to add a new feature to the model. I would think that something like BVerfG
should be very easily trainable using the current model. If you have a a long list of court-abbreviations it might make sense to add a dictionary for them, but otherwise I don't think you profit a lot from adding a new feature for it. The reason is that the word itself is also a feature, so if you have some occurrences of BVerfG
in your dataset and you consistently label it as 'court' I would think that the model will gain a very robust understanding of that link.
That said, if you don't have a dictionary of known court abbreviations, but you would know that it is extremely common for them to use abbreviations with mixed capitalisation. Using our caps feature they would all get classified as 'other' -- maybe that feature could be extended by additional patterns that would help distinguish these court abbreviations. Then, again, given some training material that linked these abbreviated court names to the court label I think should be enough.
The way I would try to think about it is this: when you look at the reference, what information makes you know how to classify a given word. Then print out the set of feature information that AnyStyle currently creates for that word. If the salient information can be inferred by those features then the model should easily learn it if you feed it some consistently labelled data.
I don't mean to discourage adding new features, I just think, in general, that the model is easier to understand and reason about if there are less features (also labels) -- in fact, I would suspect we already have more features than necessary though I have no hard evidence to back up this suspicion.
Thanks for that. Actually, the simpler the solution is the better, and if I don't need to add a feature I'd gladly omit that step! I was simply thinking that this was the thing to do. So in order to catch all courts I can also generate a list of courts with the label and use this synthetic training material to tell AnyStyle about them. I could also generate synthetic training material for court decisions, legal codes, and law journals, and just include them in the training material. So all that would be left to do is to add a categorizer to translate the labels into CSL fields/types. Is this correct?
Well, @a-fent's assesment will be more on point here since he has already worked with similar data. I'd definitely start with the current set of features; it might be necessary to add one or two new labels (like authority -- I think we don't use that one yet), but you can do this simply by supplying training data.
Normalizers should be easy to modify or add so that the end result includes the necessary CSL fields -- if they are general purpose we can add them here, but they're easy to add to your own setup as well. Similarly, the type classifier can be amended quite easily.
I'd explore adding new features only if the results from the labeling phase are inferior even when supplying sufficient training data.
Yes, definitely look to use training before messing around with features. Train with real, or at least realistic, data - i.e. full citations, not lists of words. Make a test set of marked-up citations that the Parser isn't trained on so you can track regressions.
If the set of entities you're interested in (e.g. courts, laws, cases) is fairly small and they have distinctive identities (=BVerfG=) they will be picked up quickly by the word-literal feature @inukshuk mentions.
Adding a feature was worth it for me b/c I had a (1) a large set of relevant entities with (2) names that were prone to confusion with other labels and (3) messy data with mixed data types and inconsistent citation formats. A dictionary-type feature (like journal, place) is probably only worth it if you have hundreds or more entities.
Since the technical questions have been discussed in this issue, I am continuing on from the issue on signal words here - this is not so much about legal citations and authorities as such (because I haven't gotten to this part yet), but about training for recognizing the introductory signal words and phrases mentioned in the other issue.
To recap, what I want to achieve is that AnyStyle recognizes these phrases (see examples) and label them so that they won't be labelled as part of the reference and can also serve as an indicator of where two references in the same line can be separated. I am inclined to think that a custom dictionary feature performs better here because the are almost always a very strong indication of the label and, and training (with synthetic data) hasn't been successful so far. Of course, since it is not about the words only, but about whole phrases (<signal>For a detailed account of ..., see, for example,</signal><author>John Doe</author> ...
), training with a lot of examples is still necessary.
If I want to test whether training using the existing features or a new custom dictionary feature perfoms better, there are two things left that I haven't fully understood yet.
The first one is just a clarification: I assume it is not possible to add a generic feature that would allow to associate a list of words to a particular label, since each feature (=>label) requires its own column for Wapiti.
If this is so, I don't understand yet how the column number in a new pattern that I add to my-custom-parser.txt
is connected with my custom Feature
subclass. For instance, in your example you use columns 20, 21 - where do these numbers come from (and what does "8" refer to)? And does the name "Crt" matter - because I don't see any mention of the names in the Feature classes - i.e. how do the Ruby Feature
s know which Wapiti values to use and vice versa?
U:Crt-1 X=%x[ 0,20] U:Crt-1 C=%x[ 0,20]/%x[ 0, 8] U:Crt-2 X=%x[ 0,21] U:Crt-2 C=%x[ 0,21]/%x[ 0, 8]
@a-fent I don't know if your code is open source and published but if so, it would probably easiest if you could just point me to it.
I think you're putting to much hope on a dictionary feature for this. It's helpful to look at the data that AnyStyle prepares for wapiti -- this is also what the columns in the pattern refer to.
For example:
require 'anystyle'
AnyStyle.parser.prepare('Vgl. John Doe, 2022')
Returns the dataset including all the feature observations. You can inspect it, e.g., to look at the Vgl.
token specifically, you could access it with array syntax [0][0]
(first sequence, first token), but you can also print the entire dataset:
puts AnyStyle.parser.prepare('Vgl. John Doe, 2022.').to_s
Vgl. vgl Lu P V Vg . l. initial none F F F F none first period none strong F
John john Lu Ll J Jo n hn initial none T T T T none 3 none none none F
Doe, doe Lu P D Do , e, initial none T F F F none 5 other none weak F
2022. 2022 N P 2 20 . 2. other year F F F F none last period none strong F
Adding a dictionary for some of the words you're concerned with here, would add one more column (which you can then reference in the pattern file). One benefit of the pattern file is that you can relate multiple observations of the same token and, importantly also neighbouring tokens. I don't really see how adding a dictionary for these signal words would be that helpful:
Hi, thanks. Of course I trust your judgement on that. Maybe I just need more manually annotated material. The synthetic one simply has a random sample of the signal words at the beginning and between references, so that might be the problem.
Ok, I put some more love in the parser annotations. You can process some particular nasty footnotes here
If you select "Model" -> "footnotes", then "Parse/Segment", then "Parse/Segment" -> "Auto-tag ...", you get:
Of course, this is the result of training with itself, so it is not new unseen data - which has performed much worse. but I hope it will get better with more annotations.
In my target literature, there are many references to court cases, regulations and laws. AnyStyle does not support this well, as there are no features, normalizers or formatters that cover the citation practices regarding these references. Is this something that should go into the AnyStyle core or rather a case of writing a custom parser? In any case, could you give some suggestions how to extend the parser with such functionality? if I understand the CSL specification correctly, I would need to output the "authority" (such as a court) and the "references" (for the case number etc.) CSL fields. To this end, I assume additional features should be added (such as weighing "X v. Y" highly as an indicator of a court case (in Germany, it would be a list of court abbreviations).