Open cboulanger opened 2 years ago
Trying to train a model with this gold, I am getting
INFO [2022-08-11 18:37:56 +0200] wapiti: load patterns
INFO [2022-08-11 18:37:57 +0200] wapiti: initialize model
INFO [2022-08-11 18:37:57 +0200] wapiti: nb train: 1865
INFO [2022-08-11 18:37:57 +0200] wapiti: nb labels: 13
INFO [2022-08-11 18:37:57 +0200] wapiti: nb blocks: 97424
INFO [2022-08-11 18:37:57 +0200] wapiti: nb features: 1274624
INFO [2022-08-11 18:37:57 +0200] wapiti: training model with l-bfgs
ruby: vmath.c:281: xvm_expma: Assertion `r != NULL && ((uintptr_t)r % 16) == 0' failed.
Another question: for training, where should the token "in: " go, as in:
<sequence>
<author>N. Dimmel: </author>
<title>Armutspotential zwischen Nichtinanspruchnahmeund Repression, </title>
<editor>in: R. Teichmann (Hrsg.): </editor>
<container-title>Sozialhilfe in Österreich, Wien </container-title>
<date>1989</date>
</sequence>
<sequence>
<author>V. Gessner: </author>
<title>Rechtssoziologie und Rechtspraxis. Zur Rezeption empirischer Rechtsforschung, </title>
<journal>in: Soziale Welt </journal>
<volume>35 (</volume>
<date>1984)</date>
</sequence>
I assume it belongs into <editor>
and <journal>
and not as a suffix to the <title>
but please let me know if that's a wrong assumption. Will it be removed by the normalizers?
I posted the current version (cleanup is still ongoing) to a gist: https://gist.github.com/cboulanger/9417648552d775d523d6961d575bc555
Yes, 'in' should definitely go with editors (it's a good marker!). The editor normalizer will strip it off. I'm not sure I've seen it often in the context of journals but we'd obviously follow the same approach there (would have to check if the journal normalizer already strips it though).
Any idea about the ruby: vmath.c:281: xvm_expma: Assertion 'r != NULL && ((uintptr_t)r % 16) == 0' failed.
error?
Maybe an empty tag somewhere?
Is there a chance you could try to train a parser model with https://gist.github.com/cboulanger/9417648552d775d523d6961d575bc555 to see if you get the error as well or if it is just my setup?
Trying to train a model with this gold, I am getting
INFO [2022-08-11 18:37:56 +0200] wapiti: load patterns INFO [2022-08-11 18:37:57 +0200] wapiti: initialize model INFO [2022-08-11 18:37:57 +0200] wapiti: nb train: 1865 INFO [2022-08-11 18:37:57 +0200] wapiti: nb labels: 13 INFO [2022-08-11 18:37:57 +0200] wapiti: nb blocks: 97424 INFO [2022-08-11 18:37:57 +0200] wapiti: nb features: 1274624 INFO [2022-08-11 18:37:57 +0200] wapiti: training model with l-bfgs ruby: vmath.c:281: xvm_expma: Assertion `r != NULL && ((uintptr_t)r % 16) == 0' failed.
Any idea how I could debug this? I was trying to get an extended stack trace but to no avail. It would be so nice if I could get these two new xml training docs (1, 2) working with anystyle.
Looking only at the first of the linked datasets above, there are a few issues that cause wapiti to bail out. If you want to debug the native module you need to attach gdb
however if a NULL assertion fails it's almost always the issue that you have an empty tag somewhere. In your dataset there are two empty <sequence/>
tags and the file also includes two <dataset>
elements which is not supported.
Here's a diff to make fix the first dataset:
*** /home/dupin/Downloads/zfrsoz-footnotes.xml 2022-08-17 11:05:56.104535376 +0200
--- zfrsoz-footnotes.xml 2022-08-17 11:36:27.720096975 +0200
***************
*** 6290,6296 ****
<note>Mainz</note>
<date>1982</date>
</sequence>
- <sequence/>
<sequence>
<author>Ministerium für Arbeit, Gesundheit und Sozialordnung:</author>
<title>Die Situation der Frau in Baden-Württemberg,</title>
--- 6290,6295 ----
***************
*** 12850,12857 ****
<volume>23/März</volume>
<date>1990</date>
</sequence>
- </dataset><?xml version='1.0' encoding='UTF-8'?>
- <dataset>
<sequence>
<editor>Armer/Grimshaw (Hrsg.), </editor>
<title>Comparative Social Research Methodological Problems and Strategies (New York, London, Sydney, Tokio </title>
--- 12849,12854 ----
***************
*** 19142,19148 ****
<note>Mainz </note>
<date>1982</date>
</sequence>
- <sequence/>
<sequence>
<author>Ministerium für Arbeit, Gesundheit und Sozialordnung: </author>
<title>Die Situation der Frau in Baden-Württemberg, </title>
--- 19139,19144 ----
***************
*** 25702,25705 ****
<volume>23/März </volume>
<date>1990</date>
</sequence>
! </dataset>
\ No newline at end of file
--- 25698,25701 ----
<volume>23/März </volume>
<date>1990</date>
</sequence>
! </dataset>
As a general observation, those datasets are very large. It's my feeling that it's better to have a smaller set with less inconsistencies than a larger set with more errors, though I don't have hard evidence to back this up. Smaller datasets make for quicker training so that's definitely a point in favor of a smaller model. What I'd suggest to do if you have such large sets is to train only a small subset first, then use that model to check the rest of the data. If there's a high error rate I'd make the training set larger. Once the error rate is low I'd only pick out those sequences that produce errors and add only those to the training set (or review them first, because errors can often point to inconsistencies in the marked up data).
Finally, as a general tip, you can usually spot errors in large datasets quickly by using a binary search approach: keep training with one half of the dataset until there's no error. This way you can usually limit the faulty section to a small set that's easily reviewable.
Thanks so much for looking into it and I am embarrassed that the xml contained junk - I did check for empty tags (but not on the <sequence>
node) and I did try to validate but I must have used the wrong tool for it! Maybe in some future version a validation could be added that would immediately raise an error about invalid xml.
I'll break up the large xml into smaller parts based on the discipline (there's computer science, natural sciences, and social sciences in it), which might allow some interesting tests of the performance of a domain-specific vs. general-purpose dataset.
The multiple root problem was actually a copy/paste error when uploading the data as a gist, sorry. But removing the empty <sequence/>
node and splitting up the big xml into three smaller ones did the trick! Thank you very much. All models are now trained!
I've put the individual parser training files in here:
I've put a lot of work into cleaning up and fixing the annotations, throwing out a large number of sequences which were poorly annotated. So at least in theory, the annotations should be of fairly high quality.
Ok, the performance, at least measured against gold.xml
of this material isn't that great:
Model file test/models/parser-excite-computer-science.mod:
Checking gold.xml.................1252 seq 75.01% 5524 tok 15.26% 4s
Checking excite-computer-science.x 54 seq 1.48% 127 tok 0.13% 11s
Model file test/models/parser-excite-natural-science.mod:
Checking gold.xml.................1275 seq 76.39% 5958 tok 16.46% 4s
Checking excite-natural-science.xm 6 seq 0.79% 20 tok 0.09% 2s
Model file test/models/parser-excite-social-science.mod:
Checking gold.xml................. 945 seq 56.62% 3437 tok 9.49% 4s
Checking excite-social-science.xml 139 seq 2.82% 271 tok 0.26% 12s
Model file test/models/parser-zfrsoz-footnotes.mod:
Checking gold.xml.................1620 seq 97.06% 9073 tok 25.06% 4s
Checking zfrsoz-footnotes.xml..... 113 seq 5.97% 232 tok 0.84% 3s
The consistency of the annotations seems to be quite good, as seen when the model is checked against its own training material.
Well those dataset differ considerably from the data in gold.xml
so I wouldn't expect it to match it very well? I'd definitely check out the inconsistencies (by creating a delta dataset) because you have a few hundred inconsistently labeled references there. For comparison, between gold.xml and core.xml we usually have only a handful (and those are often difficult cases like container-title vs journal or director vs editor etc.).
That said, if you're looking for a combined model that gives good results for both datasets, I'd add something between 50-250 footnote references (aiming for a representative sample of course) to the core set and use that to train the model. Adding more footnote references as necessary.
Here is some more Parser gold which needs some more love because the source references are VERY messy and therefore the manual annotation were not always correct. I did quite a bit of manual correction after converting it from the EXparser format
zfrsoz-footnotes-corrected.xml.txt
If you spot any obvious mislabelings that could confuse the parser, please let me know. I am happy to repost the material after some more cleaning & correcting.
But here's my question: in German footnotes references (and alson sometimes in bibliographies), it is common to use backreferences to the previous footnote in the form of "ders." (the same author - male) or "dies." (the same author-female). In bibliographies, this sometimes appears in the form of "__". Or it is referred to the previously cited work with "op. cit.", "a.a.O.", etc.
Do you have any opinion on if/how AnyStyle could handle these cases - or should it be left to the postprocessing of CSL data?