kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.5k stars 449 forks source link

Gold training data web editor #139

Open kaplun opened 8 years ago

kaplun commented 8 years ago

Currently, for prospect training data producers there is an entrance barrier due to the need of manually editing the TEI XML produced by Grobid in order to introduce corrections that can be later feed back into the training.

It would be great to have a web interface similar to Google Structured Data Markup Helper that would greatly reduce the effort of producing training data.

Was such interface already considered or planned?

lfoppiano commented 8 years ago

Hi @kaplun, thanks for your email. Having an interface to help correcting training data is, without doubt, a very nice to have feature. Although I think that with or without an interface, correcting training data is a difficult, boring and frustrating job anyway :) On the other hand I have to say that with the generation of pre-labelled training data, GROBID saves already a lot of work. We have seen that almost anybody with some knowledge of the training guidelines could easily and successfully correct training data.

To answer your question, we have thought about it, but given the time available (nobody is full-time on this project) and the amount of stuff we have on the plate, we have no plans at the moment. However GROBID is an opensource project and we would be happy to include additional features from external contributors. ;-)

Cheers Luca

kaplun commented 8 years ago

Hi @lfoppiano. Thanks for the confirmation. I fully understand. I was indeed checking just in case, because, since you confirm nobody is already actively working on this, we might plan to contribute such an interface, in the future.

lfoppiano commented 8 years ago

@kaplun cool stuff. :-) How advanced are you on this? I'm asking because we had already gave some thoughs on the subject. We could exchange ideas and solutions.

kaplun commented 8 years ago

Completely blank slate t, :) Just considering such functionality as one future project. We haven't even yet allocated resources to it :)

What I can tell you is that if we were to implement it, we would implement a generic indipendent component, using Angular 2+Bootstrap for front end and Python-Flask for the backend.

cc: @jmartinm

lfoppiano commented 7 years ago

@kaplun I forgot to mention, you might want to have a look at https://github.com/Vi-dot/grobid-smecta ;-) At the moment is tailored to handle astronomical training data (https://github.com/kermitt2/grobid-astro) but looks promising. This could be a common point of discussion in Berlin :-)