openminted / uc-tdm-socialsciences

Social sciences literature Text Mining software collection
https://builds.openminted.eu/job/uc-socialsciences/ws/ss-doc/target/generated-docs/user-guide.html
Other
1 stars 1 forks source link

Train and use own model for stanford NER #17

Closed maxxkia closed 7 years ago

maxxkia commented 7 years ago

Tasks:

maxxkia commented 7 years ago

@neumannm I created mapping files and put them under the following directory

resources/de/tudarmstadt/ukp/dkpro/core/stanford/lib/

I think if we put the trained models inside this directory and fix the properties files so they point to the correct model the pipeline should work. So then StanfordNER will automatically pick the correct model and there's no need to hard-code modelLocation.

neumannm commented 7 years ago

Sounds good. What are the properties files exactly about, how are they used?

maxxkia commented 7 years ago

@neumannm the variants mapping file

resources/de/tudarmstadt/ukp/dkpro/core/stanford/lib/ner-variants.map

maps languages to NER models. I added this file as a parameter in StanfordNamedEntityRecognizer in

eu.openminted.uc.socialsciences.ner.main.Pipeline.main()

so the component should be able to find the right component for English and German automatically.

There are also type mapping files which are used internally by annotation components in DKPro Core to map types detected by the underlying component (e.g. PERind, ORGsci, etc. in our case) to our defined type system (for instance webanno.custom.NamedEntity.PERind).

resources/de/tudarmstadt/ukp/dkpro/core/stanford/lib/ner-de-ss_model.crf.map
resources/de/tudarmstadt/ukp/dkpro/core/stanford/lib/ner-en-ss_model.crf.map

I also added path to these files as parameters for the StanfordNamedEntityRecognizer in our pipeline:

"classpath:/de/tudarmstadt/ukp/dkpro/core/stanfordnlp/lib/ner-${language}-${variant}.map"

So for instance for German this should result to ner-de-ss_model.crf.map (based on ner-variants.map file) which is the correct mapping file for German.

@reckart can you verify this?

maxxkia commented 7 years ago

I also manually created properties files for our models

ner-de-ss_model.crf.properties
ner-en-ss_model.crf.properties

@reckart is this the correct way to do?

reckart commented 7 years ago

If the map files follow the naming conventions of DKPro Core, then setting the parameter in the pipeline should not be necessary. Cf. StanfordNamedEntityRecognizer line

       mappingProvider.setDefault(MappingProvider.LOCATION, "classpath:/de/tudarmstadt/ukp/dkpro/"
                + "core/stanfordnlp/lib/ner-${language}-${variant}.map");

As for the properties files that control the training: well, yes, you either get some default files from the Stanford CoreNLP GitHub or you build your own.

Instead of changing the default variants file, however, you should just specify PARAM_VARIANT in your pipeline.

neumannm commented 7 years ago

Ok I don't quite get what the properties files you created are for. They don't seem to be the properties that control the training, because they state 'downloaded', 'tool', 'language', 'variant', 'md5' and 'sha1' properties that are not part of the training properties.

The current training properties file is trainingProperties.txt. Should I move this somewhere or rename it? Also, should we turn the training component into a DKPro component as well? Currently it's just a regular java class with a main method.

maxxkia commented 7 years ago

@neumannm these properties files are just used by StanfordNamedEntityRecognizer component in DKPro Core, they do not play any role in training. So we still need trainingProperties.txt file for training and it can be placed in any directory we wish.

It's a good idea to convert it to a UIMA component so it can be reused by others as well.

reckart commented 7 years ago

So these properties files that contain downloaded, tool, etc. - these belong to the model resolving mechanism. Neither these nor the model files should be in the repository/source folders. Instead, there should be a src/scripts/build.xml file that is an Ant script which downloads the models from some website or digital repository (e.g. CLARIN LINDAT) and packages them up into Maven artifacts which are then to be deployed on zoidberg.

reckart commented 7 years ago

@neumannm a UIMA component that would train a stanford NER model would be great to have. It would be a fine contribution to DKPro Core actually.

neumannm commented 7 years ago

Ok thanks for the clarifications, you both.

@reckart I thought that, too. I'm not sure if I will be able to do this, because I still don't quite get all the mechanisms involved in uimaFIT/DKPro Core. But sure I will try.

reckart commented 7 years ago

We have a NER trainer for OpenNLP in DKPro Core that might serve as a template for you: https://github.com/dkpro/dkpro-core/blob/master/dkpro-core-opennlp-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/opennlp/OpenNlpNamedEntityRecognizerTrainer.java

For a contribution to DKPro Core, following the contribution guidelines, we'd also need a contributors license agreement (see: http://dkpro.github.io/contributing/)

neumannm commented 7 years ago

@reckart Thanks. Apart from the license issue, do you think you could quickly walk me through the process of contributing a single component? I had a look at the project website, but I'm unsure if the "Release Guide" is what I have to follow. It looks really complex for "just" adding a new component. I never contributed to any project so I'm not familiar with these processes.

maxxkia commented 7 years ago

It's fairly simple. First, you need to apply the provided Eclipse style to your IDE if you are using Eclipse. Second, you put your class in the appropriate module (stanfordnlp-gpl) with appropriate license header (like the other source files in this module). Then commit and push your changes. Before all of this you also need to send a signed copy of the license agreement.