Open kermitt2 opened 6 years ago
Thanks Patrice. What we did was have multiple coders earlier and I found better agreement than this (I don't have the numbers at hand at present). But I wouldn't be too surprised if things have drifted out of sync. Another issue is that, coming from qualitiative research, I'm used to inter-coder reliability of .7 and .8 kappa (although I also used straight percent agreement). I truly wonder whether any manual coders can be trained to achieve 95-99% agreement (or rather I wonder what needs to be done to accomplish those published figures, I suspect there is lots of "talked to agreement" going on). Do you think that those NER figures represent "2 trained coders working independently on items"? Or do they represent "2 trained coders agreeing after discussing items on which they disagreed"?
It sounds like the most important thing now is to have a second coder annotate all of the single coded articles, then recalculate, update scheme, and have a third coder annotate those. In your opinion how important is it to have the second coder look at the whole article?
Is it possible that you could start to put your code into the repo so that I can look over it as we move this forward?
I am going to add kappa and other measures (for more than 2 annotators, avoid chance agreement, etc.).
I am not a specialist of corpus annotation, but I think the methodology for reaching > .90 agreement normally involves indeed agreement talks ("reconciliation") at least in the first stage of the annotation process. Either "reconciliation" takes place until the annotators reach a certain annotation quality and the annotation guidelines are improved iteratively - and then we go for the annotation of a corpus of larger size, or reconciliation is maintained all the time. There are quite a lot of papers discussing the development of annotated corpus, and some of them recommend "reconciliation" only at the beginning to avoid annotator "overtraining".
In NLP, Ontonotes is famous for its 90% IIA policy for a very large scale annotated corpus (2M words for the part usable for NER). Smaller corpus often reports something higher. This article discusses quite in details the methodology for biomedical entities (and relations). Biocreative competitions are also well known in NLP for high quality corpus for text mining applications.
I would say before going to large scale human annotations, it's good to look at the current mismatches between annotators and strengthen the guidelines. Just looking superficially, I saw some recurrent inconsistencies (programming languages , databases), a lot of small consistencies for word boundaries (creator including address or not, affiliation or not, etc. "version" words in or out the annotation, etc.). I am not able at this stage to prioritize them, but I am on the way and I think all of this represent a large part of the mismatches.
All my code is in the impactstory repo https://github.com/Impactstory/software-mentions Everything is work on progress :) Current work on cross agreement is here Given that I need to consider representation and features of the whole PDF to really make use of the dataset, I need to stick with a GROBID and Java pipeline.
Thanks for all of those, Patrice. We'll get it improved ASAP. We did do agreement and improvement, but clearly not sufficient!
One thing I didn't say is that we did re-training and clarified the coding scheme after those double coded articles, but we ended up not doing more double coding to check that things had sufficiently improved. That was in part because I was getting advice that full agreement was less important than quantity of coded examples. Seems that that was wrong (or at least applicable to different ML techniques). So hopefully we would have had better agreement across the later coded articles.
But we should be checking that so I have added new tasks to the queue to get multiple coders for a few hundred articles, starting with those most recently coded. Hopefully that will enable coders to discuss their coding. I think that's going to take a few weeks but then we'll know better where we are.
@kermitt2 Do you think it is valid to do agreement coding by having the second coder look at a set of pages, rather than the whole article? Set A: pages that coder 1 found mentions on Set B: randomly selected pages that coder 1 did not find mentions on.
Perhaps Set A expanded to page before and after each mention. Perhaps Set B doubled in size? Set B tripled on articles where coder 1 didn't find any mentions?
Definitely present the pages intermingled and without an indication of whether a page is in Set A or Set B?
I'm trying to reduce the effort of reading all the pages. Trade off seems highest on missing false negatives from Coder 1, is coding a single random set of allegedly non-mention pages valuable?
Sorry James to answer late, overlooked the message!
I looked at some examples and my impression is that it's valid - in general the disagreements appear located on the same page (and as you said at some point, we also see that software mentions come by "cluster").
I would not even expand Set A to page before or after each mention - from what I saw only the pages where code 1 found at least one mention is enough. Same for Set B, just a few random pages would be enough for inter-agreement evaluation.
It has no impact on calculating the Inter-Annotator Agreement, because the pages without annotations would basically not been considered for both annotators.
Last argument for doing this way, when training a ML model in general in the case where there are very few annotations per document (like software!), we usually not use all the document, but a window around each mention (kind of over-sampling, making mentions more frequent as they really are to correct the "natural" bias of a ML to ignore super rare events).
Hi @kermitt2 I added the article_set to the csv dataset. So now we can distinguish articles coded before or after the re-training. Is it easy to re-run agreement on just the econ articles? Only 140 have been double coded there.
I am using csv dataset generated on Oct. 15 - do you have something more recent not committed?
I have: econ_article: 184 documents number of documents annotated by multiple annotators: 13 (so inter-annotator agreement is estimated on 13 documents)
software agreement measure: 0.5875
standard error: 0.0004
confidence interval: [0.5868-0.5882]
number of agreements: 789
number of samples: 1343
version-number agreement measure: 0.3589
standard error: 0.0007
confidence interval: [0.3575-0.3603]
number of agreements: 234
number of samples: 652
version-date agreement measure: 0.4484
standard error: 0.0014
confidence interval: [0.4457-0.451]
number of agreements: 165
number of samples: 368
creator agreement measure: 0.3837
standard error: 0.0004
confidence interval: [0.3829-0.3844]
number of agreements: 493
number of samples: 1285
url agreement measure: 0.6462
standard error: 0.0074
confidence interval: [0.6317-0.6606]
number of agreements: 42
number of samples: 65
quote agreement measure: 0.4055
standard error: 0.0005
confidence interval: [0.4046-0.4064]
number of agreements: 442
number of samples: 1090
all fields agreement measure: 0.4508
standard error: 0.0001
confidence interval: [0.4506-0.451]
number of agreements: 2165
number of samples: 4803
I attach here some reporting on the mismatches in relation to the econ articles with multiple annotators (so the 13 documents above). 13 documents is not a lot, but it illustrates the kind of issues we see in general for the rest of documents.
For reporting a mismatch, I produce a file per field (software
, url
, version_date
, version_number
, creator
, quote
). For each file a give the name of the file and then the list of mismatches.
A mismatch is indicated by first the name of the annotator, then the value that is not matching. Second I put the name the other annotator with the list of all the values for this field in the document, in order to get a idea of the reason of the mismatch.
So for instance for software name:
10.1111%2Fj.1467-6419.2007.00527.x
-------------------
tonyli0409: pasmatch2
mrcyndns: PSMATCH2 / Stata / nnmatch / Stata / psmatch2
mrcyndns: nnmatch
tonyli0409: PSMATCH2 / Stata / Stata / pasmatch2
This means
pasmatch2
from tonyli0409
is not matching the values annotated by mrcyndns
(indeed there is a typo). nnmatch
from mrcyndns
is not in the list of the annotated values of tonyli0409
. You will see a lot of obvious consistency issues which are easy to fix I think:
Here are the reports for the 13 econ documents. It's not aways very readable (not readable for quote
, because the strings are too long), but it gives an idea of the kind of problems we have.
mismatch-creator.txt mismatch-quote.txt mismatch-software.txt mismatch-url.txt mismatch-version-date.txt mismatch-version-number.txt
@caifand Can you start to work through the issues that Patrice is identifying here. I think we'll have to do a combination of some retraining and some post-processing. We should definitely be able to resolve the version number vs software name issues via training.
Thanks @kermitt2 I did just now upload more articles to the csv dataset.
Part of the issue here is that there are lots of econ articles without any mentions at all, so I see 12 econ articles with more than 1 coder, but 112 articles with more than one coder where no mentions were found.
Clearly we need to move on the from the econ dataset, because we're not getting sufficient new mentions from that coding.
That said, does it make any sense to you to produce different agreement statistics for positive and negative examples of the codes? In the past I've used Byrt's kappa to adjust for the differing prevalence of the positive and negative codes: https://www.ncbi.nlm.nih.gov/pubmed/8501467
@caifand Can you start to work through the issues that Patrice is identifying here. I think we'll have to do a combination of some retraining and some post-processing. We should definitely be able to resolve the version number vs software name issues via training.
OK. Glad to do.
I will produce kappa yes, I wanted to do it from the beginning to have a better agreement overview, I am just a bit slow on that! The dkpro statistics module permits to calculate various agreement measures for sequence labelling, even beyond percentage and kappa: https://dkpro.github.io/dkpro-statistics/
But I think, for the existing mentions, rare or not, it's important to homogenize and constrained the annotations produced by different annotators, otherwise the machine learning algorithm has no way to learn which good answer variant is exactly expected.
Completely agree Patrice!
On Mon, Nov 12, 2018 at 14:47 Patrice Lopez notifications@github.com wrote:
I will produce kappa yes, I wanted to do it from the beginning to have a better agreement overview, I am just a bit slow on that! The dkpro statistics module permits to calculate various agreement measures for sequence labelling, even beyond percentage and kappa: https://dkpro.github.io/dkpro-statistics/
But I think, for the existing mentions, rare or not, it's important to homogenize and constrained the annotations produced by different annotators, otherwise the machine learning algorithm has no way to learn which good answer variant is exactly expected.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/howisonlab/softcite-dataset/issues/538#issuecomment-438055762, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFnUj1u5O9UeObIxEFjzGbkmMHQ7C5tks5uufp2gaJpZM4WQx4D .
Hi @kermitt2,
@caifand and I have been looking through these. Some will be relatively simple to fix, even just looking at the selections posthoc, some will require a bit of retraining.
We'd like to start working with the certainty scores (to see if less certainty explains disagreement) and being clear about when coders have selected the same pieces of text (prior to seeing if they disagree about which bits to pick out for each field after having a full_quote), both to look for fatigue issues and to know how well they do when working with the same full_quote.
I was able to work with the XML from the PMC articles to look for overlaps, but I can't do that for the econ ones (and won't be able to for astro going forward too). I'd really like to work with the XML that comes from the PDF conversions that you are doing. What's the best way to get the XML that you create for each article? Do you have that for each article in your repo somewhere? Or is there a service I can programmatically push a PDF to get converted XML?
You have different ways to converting PDF to XML with GROBID, they should all be very easy:
download and run the docker image for starting the GROBID server: https://grobid.readthedocs.io/en/latest/Grobid-docker/
or install the service locally, it only requires JDK 1.8 and 2 command lines, see:
https://grobid.readthedocs.io/en/latest/Install-Grobid/
https://grobid.readthedocs.io/en/latest/Grobid-service/
once a server is running, you can use one of the clients to process easily a large amount of PDF, for instance: https://github.com/kermitt2/grobid-client-python (there are also a node.js and a java client)
there's also command lines in GROBID for batch PDF processing, so without the need of a server, but it is working single thread, so it will be significantly slower than the server: https://grobid.readthedocs.io/en/latest/Grobid-batch/
As it is machine learning, it's not perfect, and maybe some of the econ articles won't be very well transformed. In this case, you could consider adding a few annotated econ examples in the training data and update the model. If you see some bad conversions, just tell me.
Hi @kermitt2
I finally got around to trying this. I installed grobid 0.5.3 using docker (following the directions above). I cloned the python client and put a PDF in a testpdf directory. I changed the port in config.json to 8080. The service receives the request but then then commandline returns (after 30s or so) but doesn't indicate any issues. Checking the server logs it seems to die silently?
This is the commandline I used:
python3 grobid-client.py --input testpdfs/ --output testoutput/ processFulltextDocument
See server log (or at least docker commandline output) in attachment.
Hello @jameshowison !
I've just tried with this version of docker and the current python client and it is working fine for me. Note that after a server is started, for the first request, the server takes 20-25s to load all the model in memory.
Output of the client should be something like this:
> python3 grobid-client.py --input ~/tmp/in5 --output ~/tmp/out processFulltextDocument
1 PDF files to process
/home/lopez/tmp/in5/malaria.pdf
runtime: 23.197 seconds
For further query, the models are loaded so it's faster:
> python3 grobid-client.py --input ~/tmp/in5 --output ~/tmp/out processFulltextDocument
1 PDF files to process
/home/lopez/tmp/in5/malaria.pdf
runtime: 2.029 seconds
> ls -al ~/tmp/out/
-rw-rw-r-- 1 lopez lopez 130K Jan 8 21:24 /home/lopez/tmp/out/malaria.tei.xml
Normally Grobid will log this after the process of the query:
172.17.0.1 - - [08/Jan/2019:20:24:38 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 132301 "-" "python-requests/2.18.4" 23137
So if it is not behaving like that, could you give me more info about your environment so that I can try to reproduce the problem? So in this case, OS, python version, docker version...
Server seems to shutdown silently after loading the models, not sure if you were able to see the docker command line output in the attachment on my last message. Happy to find other logs, if you can guide me a little more :)
Mac OS X:
SOI-A14570-Howison:~ howison$ uname -a Darwin SOI-A14570-Howison.T-mobile.com 17.2.0 Darwin Kernel Version 17.2.0: Fri Sep 29 18:27:05 PDT 2017; root:xnu-4570.20.62~3/RELEASE_X86_64 x86_64
(env) SOI-A14570-Howison:grobid-client-python howison$ python3 --version Python 3.7.1
SOI-A14570-Howison:~ howison$ docker run -t --rm --init -p 8080:8070 -p 8081:8071 lfoppiano/grobid:0.5.3
SOI-A14570-Howison:~ howison$ docker --version Docker version 18.09.0, build 4d60db4
Docker Desktop Version 2.0.0.0-mac81 (29211) Compose 1.23.2 Machine 0.16.0
I saw the grobid server logs and any problems should be visible there, apparently it simply stops which is something I've never seen.
To check the service differently, after launching the docker command, can you connect to the console at http://localhost:8080 ?
From there you can upload a PDF and test the service (TEI
tab, select process fulltext document
, select file, submit). If it fails there too, I will ask to someone who knows docker more than me.
Otherwise you could simply install the project and build it, which is 2 command lines (you need JDK 1.8 or higher installed).
Alternatively, maybe the easiest if you don't mind the network latency, you could use my online GROBID server, just specify in the config.json
file of the client grobid-client-python:
"grobid_server": "cloud.science-miner.com/grobid",
"grobid_port": "",
it requires to update the grobid-client-python project to support the empty port number:
cd grobid-client-python/ git pull
Thanks. Same behavior via the TEI tab (server dies, webpage shows red "Error encountered while requesting the server."). Very likely some docker issue?
I updated the python client and had success using your remote server, that's fine for me, all good.
I'm moving the conversation on what we're hoping to do with the XML files to here: https://github.com/howisonlab/softcite-dataset/issues/580
Yes very likely docker, so I invoke the GROBID docker master for help @lfoppiano :)
@kermitt2 @jameshowison I'm going to have a look at this in the next days
@jameshowison I have the same version of Docker as you and yes it seems that the container just get show down.
Plus I see some strnage messages from the language detector:
INFO [2019-01-11 19:56:14,497] org.eclipse.jetty.server.handler.ContextHandler: Started i.d.j.MutableServletContextHandler@40a1b6d4{/,null,AVAILABLE}
INFO [2019-01-11 19:56:14,511] io.dropwizard.setup.AdminEnvironment: tasks =
POST /tasks/log-level (io.dropwizard.servlets.tasks.LogConfigurationTask)
POST /tasks/gc (io.dropwizard.servlets.tasks.GarbageCollectionTask)
INFO [2019-01-11 19:56:14,526] org.eclipse.jetty.server.handler.ContextHandler: Started i.d.j.MutableServletContextHandler@79ab97fd{/,null,AVAILABLE}
INFO [2019-01-11 19:56:14,555] org.eclipse.jetty.server.AbstractConnector: Started application@429aeac1{HTTP/1.1,[http/1.1]}{0.0.0.0:8070}
INFO [2019-01-11 19:56:14,557] org.eclipse.jetty.server.AbstractConnector: Started admin@79eeff87{HTTP/1.1,[http/1.1]}{0.0.0.0:8071}
INFO [2019-01-11 19:56:14,557] org.eclipse.jetty.server.Server: Started @9114ms
INFO [2019-01-11 19:57:34,254] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10
INFO [2019-01-11 19:57:34,256] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 2/10
INFO [2019-01-11 19:57:34,257] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 3/10
INFO [2019-01-11 19:57:34,258] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 4/10
INFO [2019-01-11 19:57:34,259] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 5/10
INFO [2019-01-11 19:57:34,260] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 6/10
INFO [2019-01-11 19:57:34,261] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 7/10
INFO [2019-01-11 19:57:34,262] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 8/10
INFO [2019-01-11 19:57:34,262] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 9/10
INFO [2019-01-11 19:57:34,262] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 10/10
INFO [2019-01-11 19:57:34,330] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/fulltext/model.wapiti (size: 20462507)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/fulltext/model.wapiti"
Model path: /opt/grobid/grobid-home/models/fulltext/model.wapiti
INFO [2019-01-11 19:57:37,718] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/segmentation/model.wapiti (size: 15807193)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/segmentation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/segmentation/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/header/model.wapiti"
INFO [2019-01-11 19:57:40,659] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/header/model.wapiti (size: 36094028)
Model path: /opt/grobid/grobid-home/models/header/model.wapiti
INFO [2019-01-11 19:57:49,343] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/figure/model.wapiti (size: 679648)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/figure/model.wapiti"
Model path: /opt/grobid/grobid-home/models/figure/model.wapiti
INFO [2019-01-11 19:57:49,677] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/table/model.wapiti (size: 1337339)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/table/model.wapiti"
Model path: /opt/grobid/grobid-home/models/table/model.wapiti
WARN [2019-01-11 19:57:50,821] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles
INFO [2019-01-11 19:57:50,825] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/name/header/model.wapiti (size: 2225578)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/name/header/model.wapiti"
WARN [2019-01-11 19:57:50,827] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles
WARN [2019-01-11 19:57:50,829] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles
WARN [2019-01-11 19:57:50,834] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles
WARN [2019-01-11 19:57:50,837] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles
WARN [2019-01-11 19:57:50,839] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles
WARN [2019-01-11 19:57:50,841] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles
WARN [2019-01-11 19:57:50,843] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles
WARN [2019-01-11 19:57:50,846] org.grobid.core.utilities.LanguageUtilities: Cannot detect language because of: java.lang.IllegalStateException: Cannot read profiles for cybozu language detection from: /opt/grobid/grobid-home/language-detection/cybozu/profiles
WARN [2019-01-11 19:57:50,870] org.grobid.core.lang.impl.CybozuLanguageDetector: Cannot detect language because of: com.cybozu.labs.langdetect.LangDetectException: no features in text
Model path: /opt/grobid/grobid-home/models/name/header/model.wapiti
INFO [2019-01-11 19:57:51,050] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/name/citation/model.wapiti (size: 393118)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/name/citation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/name/citation/model.wapiti
INFO [2019-01-11 19:57:51,092] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/affiliation-address/model.wapiti (size: 2700194)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/affiliation-address/model.wapiti"
Model path: /opt/grobid/grobid-home/models/affiliation-address/model.wapiti
INFO [2019-01-11 19:57:51,444] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/citation/model.wapiti (size: 16235248)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/citation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/citation/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/reference-segmenter/model.wapiti"
INFO [2019-01-11 19:57:58,119] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/reference-segmenter/model.wapiti (size: 4829569)
Model path: /opt/grobid/grobid-home/models/reference-segmenter/model.wapiti
172.17.0.1 - - [11/Jan/2019:19:58:00 +0000] "POST /api/processFulltextDocument HTTP/1.1" 200 1194 "-" "python-requests/2.19.1" 26402
INFO [2019-01-11 19:58:00,234] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 10/10
INFO [2019-01-11 19:58:01,621] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/date/model.wapiti (size: 102435)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/date/model.wapiti"
Model path: /opt/grobid/grobid-home/models/date/model.wapiti
However the docker container does not log anything whatsoever... that's very weird...
I'm continuing to dig :-)
@kermitt2 and @caifand Can we circle back to this? Patrice, we're trying to get some of the these disagreements reviewed. In your workflow I think you locate each full_quote in your post GROBID tei xml file? And then you are dropping examples with disagreement? Any chance you could produce some output that points us to selections with disagreements so that we can review them? Just the ones that could otherwise be training input (ie in-text-mentions coded as software).
Something like:
article, coder, selection, tei_xml_filename, tei_xml_offset, full_quote
then we could look for overlap (or lack of) and work through to resolve the two types of disagreement:
Just an update (quite late - sorry) on https://github.com/howisonlab/softcite-dataset/issues/538#issuecomment-453768745, it seems that adding more memory to the docker host would prevent it by killing the container (you can follow this specific issue on https://github.com/kermitt2/grobid/issues/416)
I produced some statistics on Inter-Annotator Agreement. I am using for the moment a simple Percentage Agreement measure which is ok I think at this stage (the agreements by chance are negligible here because there are only a couple annotations per document of 2000-3000 tokens and the classes are not too unbalanced). If I didn't make an implementation error (which is possible!), this should be the current IIA estimate with standard error (confidence interval at 95% confidence):
So we see that Inter-Annotator Agreement is very low and this is to be put in perspective with the current accuracy estimation of the simple CRF annotator (in the range of 55-60 f-score). This also includes the 4% problems of annotated mentions which do not match the provided quote, issue #507. But it means that at this stage, it's not reliable as training data for a supervised ML annotator. Usual IIA for NER corpus are in the range 95-99.
I am going to try to compile the mismatches in a way it is usable for a "consensus" stage and to identify the type of mismatches to help prioritizing. It will help also to be sure that I didn't make an implementation error. I will also try to produce more robust IIA measures in the π, κ, and α families (using https://dkpro.github.io/dkpro-statistics/).
I think we could consider to improve the annotations by enforcing the following methodology: