Relation extraction model

stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.

http://stanfordnlp.github.io/CoreNLP/

GNU General Public License v3.0

9.71k stars 2.7k forks source link

Relation extraction model #1177

Closed iresiragusa closed 3 years ago

iresiragusa commented 3 years ago

Hi! I've developed my corpus to train a custom relation extraction model as explained in https://nlp.stanford.edu/software/relationExtractor.html

There is a demo of my corpus (DatasetDemoTrain2.corp):

1   Peop    0   O   NNP/NNP Harry/Potter    O   O   O
1   O   1   O   VBD was O   O   O
1   O   2   O   VBN born    O   O   O
1   O   3   O   IN  in  O   O   O
1   Loc 4   O   NNP London  O   O   O
1   O   5   O   .   .   O   O   O

0   4   birth_place

I've used the roth.properties file from here https://nlp.stanford.edu/software/roth.properties modifying only trainpath, serializedTrainingSentencesPath, serializedEntityExtractorPath, serializedRelationExtractorPath, by putting directly the file names.

Then in the folder with (and only) roth.properties and DatasetDemoTrain2.corp I executed this command line: java -cp /Users/irene/Documents/GitHub/Tirocinio3-git/lib/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2.jar edu.stanford.nlp.ie.machinereading.MachineReading && /Users/irene/Documents/GitHub/Tirocinio3-git/lib/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2-models.jar edu.stanford.nlp.models.pos-tagger.english-left3words-distsim.tagger --arguments roth.properties

I had also to specify the tagger path (otherwise it gets me some errors since it couldn't find it).

Unfortunately it doesn't work and give me this error:

Missing required option: datasetreaderclass   <in class: class edu.stanford.nlp.ie.machinereading.MachineReadingProperties>
Missing required option: serializedtrainingsentencespath   <in class: class edu.stanford.nlp.ie.machinereading.MachineReadingProperties>
Exception in thread "main" java.lang.RuntimeException: Specified properties are not parsable or not valid!
    at edu.stanford.nlp.util.ArgumentParser.fillOptionsImpl(ArgumentParser.java:483)
    at edu.stanford.nlp.util.ArgumentParser.fillOptionsImpl(ArgumentParser.java:495)
    at edu.stanford.nlp.util.ArgumentParser.fillOptions(ArgumentParser.java:543)
    at edu.stanford.nlp.util.ArgumentParser.fillOptions(ArgumentParser.java:581)
    at edu.stanford.nlp.ie.machinereading.MachineReading.makeMachineReading(MachineReading.java:207)
    at edu.stanford.nlp.ie.machinereading.MachineReading.main(MachineReading.java:110)

Do you know how I can fix it or what I'm doing wrong?

Thank you in advance.

AngledLuffa commented 3 years ago

That's a pretty old piece of software right there.

My first guess is that you changed the properties in some way so that it's no longer usable. It looks like the example properties file has "datasetReaderClass" set, so there's no reason that should be missing unless you edited that to be unreadable somehow.

iresiragusa commented 3 years ago

I modified the roth.properties file only in the path fields while the other parts were left unchanged. The dataset is a .corp file generated by a java program by processing another corpus.

I know it's a bit old, but I didn't find in openIE a way to get the relation name from the relations extracted. Maybe it is possible with the kbp module, but I wasn't able to make it work (it gives me errors I cannot how to fix).

In essence I'm using openIE as a tool to know if in a phrase there is a relation, then I need to know the name of the extracted relation (in the prevoiuos example, I need the triple <Harry Potter, London, place_of_birth> and not <Harry Potter, London, was born in>). I find that the relation module can be trained on a particular corpus to find the desidered relation to be classified. Since this match with my desidered task, I decided to use it.

AngledLuffa commented 3 years ago

I don't know what you're doing wrong, unfortunately. If I download the sample roth.properties and run it on the conll04.corp file, it works fine for me. Can I suggest starting over with roth.properties to see if that fixes it? You could always try to paste it here and we'll take a look

iresiragusa commented 3 years ago

I will try to run those properties with the original corpurs (unfortunately the download from here http://cogcomp.seas.upenn.edu/Data/ER/ doesn't start, I'll try from other sources).

Here is my properties file:

#Below are some basic options. See edu.stanford.nlp.ie.machinereading.MachineReadingProperties class for more options.

# Pipeline options
annotators = pos, lemma, parse
parse.maxlen = 100

# MachineReading properties. You need one class to read the dataset into correct format. See edu.stanford.nlp.ie.machinereading.domains.ace.AceReader for another example.
datasetReaderClass = edu.stanford.nlp.ie.machinereading.domains.roth.RothCONLL04Reader

#Data directory for training. The datasetReaderClass reads data from this path and makes corresponding sentences and annotations.
trainPath = DatasetDemoTrain2.corp

#Whether to crossValidate, that is evaluate, or just train.
crossValidate = false
kfold = 10

#Change this to true if you want to use CoreNLP pipeline generated NER tags. The default model generated with the relation extractor release uses the CoreNLP pipeline provided tags (option set to true).
trainUsePipelineNER=false

# where to save training sentences. uses the file if it exists, otherwise creates it.
serializedTrainingSentencesPath = sentences.ser

serializedEntityExtractorPath = entity_model.ser

# where to store the output of the extractor (sentence objects with relations generated by the model). This is what you will use as the model when using 'relation' annotator in the CoreNLP pipeline.
serializedRelationExtractorPath = relation_model_pipeline.ser

# uncomment to load a serialized model instead of retraining
# loadModel = true

#relationResultsPrinters = edu.stanford.nlp.ie.machinereading.RelationExtractorResultsPrinter,edu.stanford.nlp.ie.machinereading.domains.roth.RothResultsByRelation. For printing output of the model.
relationResultsPrinters = edu.stanford.nlp.ie.machinereading.RelationExtractorResultsPrinter

#In this domain, this is trivial since all the entities are given (or set using CoreNLP NER tagger).
entityClassifier = edu.stanford.nlp.ie.machinereading.domains.roth.RothEntityExtractor

extractRelations = true
extractEvents = false

#We are setting the entities beforehand so the model does not learn how to extract entities etc.
extractEntities = false

#Opposite of crossValidate. 
trainOnly=true

# The set chosen by feature selection using RothCONLL04:
relationFeatures = arg_words,arg_type,dependency_path_lowlevel,dependency_path_words,surface_path_POS,entities_between_args,full_tree_path

# The above features plus the features used in Bjorne BioNLP09:
# relationFeatures = arg_words,arg_type,dependency_path_lowlevel,dependency_path_words,surface_path_POS,entities_between_args,full_tree_path,dependency_path_POS_unigrams,dependency_path_word_n_grams,dependency_path_POS_n_grams,dependency_path_edge_lowlevel_n_grams,dependency_path_edge-node-edge-grams_lowlevel,dependency_path_node-edge-node-grams_lowlevel,dependency_path_directed_bigrams,dependency_path_edge_unigrams,same_head,entity_counts

It is located in the Train directory along with DatasetDemoTrain2.corp file.

Still thank you for you time.

AngledLuffa commented 3 years ago

As for the original corpus, this link worked fine for me:

http://cogcomp.cs.illinois.edu/Data/ER/conll04.corp

iresiragusa commented 3 years ago

I have tried to run this command: java -cp /Users/irene/Documents/GitHub/Tirocinio3-git/lib/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2.jar edu.stanford.nlp.ie.machinereading.MachineReading --arguments roth.properties

and it gaves me this:

PERCENTAGE OF TRAIN: 1.0
The reader log level is set to SEVERE
Adding annotator pos
Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Error while loading a tagger model (probably missing model file)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:801)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:322)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:275)
    at edu.stanford.nlp.pipeline.POSTaggerAnnotator.loadModel(POSTaggerAnnotator.java:85)
    at edu.stanford.nlp.pipeline.POSTaggerAnnotator.<init>(POSTaggerAnnotator.java:73)
    at edu.stanford.nlp.pipeline.AnnotatorImplementations.posTagger(AnnotatorImplementations.java:68)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$5(StanfordCoreNLP.java:539)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$32(StanfordCoreNLP.java:620)
    at edu.stanford.nlp.util.Lazy$3.compute(Lazy.java:126)
    at edu.stanford.nlp.util.Lazy.get(Lazy.java:31)
    at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:149)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:253)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:194)
    at edu.stanford.nlp.ie.machinereading.MachineReading.makeMachineReading(MachineReading.java:233)
    at edu.stanford.nlp.ie.machinereading.MachineReading.main(MachineReading.java:110)
Caused by: java.io.IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/english-left3words-distsim.tagger" as class path, filename or URL
    at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:501)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:798)
    ... 14 more

so I run that other command specifying the missing model: java -cp /Users/irene/Documents/GitHub/Tirocinio3-git/lib/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2.jar edu.stanford.nlp.ie.machinereading.MachineReading && /Users/irene/Documents/GitHub/Tirocinio3-git/lib/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2-models.jar edu.stanford.nlp.models.pos-tagger.english-left3words-distsim.tagger --arguments roth.properties

and it gaves me back the same error as before:

Missing required option: datasetreaderclass   <in class: class edu.stanford.nlp.ie.machinereading.MachineReadingProperties>
Missing required option: serializedtrainingsentencespath   <in class: class edu.stanford.nlp.ie.machinereading.MachineReadingProperties>
Exception in thread "main" java.lang.RuntimeException: Specified properties are not parsable or not valid!
    at edu.stanford.nlp.util.ArgumentParser.fillOptionsImpl(ArgumentParser.java:483)
    at edu.stanford.nlp.util.ArgumentParser.fillOptionsImpl(ArgumentParser.java:495)
    at edu.stanford.nlp.util.ArgumentParser.fillOptions(ArgumentParser.java:543)
    at edu.stanford.nlp.util.ArgumentParser.fillOptions(ArgumentParser.java:581)
    at edu.stanford.nlp.ie.machinereading.MachineReading.makeMachineReading(MachineReading.java:207)
    at edu.stanford.nlp.ie.machinereading.MachineReading.main(MachineReading.java:110)

The folder at /Users/irene/Documents/GitHub/Tirocinio3-git/lib/stanford-corenlp-4.2.2/ is the CoreNLP downloaded from here https://stanfordnlp.github.io/CoreNLP/download.html and unzipped.

AngledLuffa commented 3 years ago

That's a really weird command line. The way to specify multiple jars in your classpath on Windows is with a ; so for example

-cp "/Users/irene/Documents/GitHub/Tirocinio3-git/lib/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2.jar;/Users/irene/Documents/GitHub/Tirocinio3-git/lib/stanford-corenlp-4.2.2/stanford-corenlp-4.2.2-models.jar"

but you can and should use a wildcard to include everything in that directory:

-cp "/Users/irene/Documents/GitHub/Tirocinio3-git/lib/stanford-corenlp-4.2.2/*"

https://docs.oracle.com/javase/7/docs/technotes/tools/windows/classpath.html

iresiragusa commented 3 years ago

Hi,

The train with the original corpus went well (along with the test). I tried to do the same thing with my corpus and the command you suggest, unfortunately this is my terminal error:

MBP-di-Irene:Train irene$ java -cp "/Users/irene/Documents/GitHub/Tirocinio3-git/lib/stanford-corenlp-4.2.2/*" edu.stanford.nlp.ie.machinereading.MachineReading --arguments roth.properties
[main] INFO edu.stanford.nlp.ie.machinereading.MachineReading - PERCENTAGE OF TRAIN: 1.0
[main] INFO edu.stanford.nlp.ie.machinereading.MachineReading - The reader log level is set to SEVERE
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words-distsim.tagger ... done [0.5 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.8 sec].
set 02, 2021 10:46:03 AM edu.stanford.nlp.ie.machinereading.MachineReading makeResultsPrinters
INFORMAZIONI: Making result printers from 
set 02, 2021 10:46:03 AM edu.stanford.nlp.ie.machinereading.MachineReading makeResultsPrinters
INFORMAZIONI: Making result printers from edu.stanford.nlp.ie.machinereading.RelationExtractorResultsPrinter
set 02, 2021 10:46:03 AM edu.stanford.nlp.ie.machinereading.MachineReading makeResultsPrinters
INFORMAZIONI: Making result printers from 
set 02, 2021 10:46:03 AM edu.stanford.nlp.ie.machinereading.MachineReading loadOrMakeSerializedSentences
INFORMAZIONI: Parsing corpus sentences...
set 02, 2021 10:46:03 AM edu.stanford.nlp.ie.machinereading.MachineReading loadOrMakeSerializedSentences
INFORMAZIONI: These sentences will be serialized to /Users/irene/Documents/GitHub/Tirocinio3-git/Dataset/Final merge 4/Train/roth_sentences.ser
Exception in thread "main" java.io.IOException: java.lang.NullPointerException
    at edu.stanford.nlp.ie.machinereading.GenericDataSetReader.parse(GenericDataSetReader.java:141)
    at edu.stanford.nlp.ie.machinereading.MachineReading.loadOrMakeSerializedSentences(MachineReading.java:915)
    at edu.stanford.nlp.ie.machinereading.MachineReading.run(MachineReading.java:270)
    at edu.stanford.nlp.ie.machinereading.MachineReading.main(MachineReading.java:111)
Caused by: java.lang.NullPointerException
    at edu.stanford.nlp.ie.machinereading.domains.roth.RothCONLL04Reader.readSentence(RothCONLL04Reader.java:124)
    at edu.stanford.nlp.ie.machinereading.domains.roth.RothCONLL04Reader.read(RothCONLL04Reader.java:55)
    at edu.stanford.nlp.ie.machinereading.GenericDataSetReader.parse(GenericDataSetReader.java:139)
    ... 3 more

roth.properties and DatasetDemoTrain2.corp files are the same as before. I also check if my corp file is equal to conll04

AngledLuffa commented 3 years ago

I would need the data to diagnose this. If you're not comfortable sharing the whole thing, you could always find a small segment of it which reproduces this error or send it to me privately.

iresiragusa commented 3 years ago

Hi, In the meanwhile I found the problem: it was related with the .corp file. On creating it, I annotaded relations between entites not labeled as Peop, Loc, Org or Other but as O. The trainer module cannot recognize them as worthy to be part in a relation and gave me this error. I fixed it and finally it works! Thanks for your help and your time, have a nice day!

AngledLuffa commented 3 years ago

Would you mind clarifying that a bit? You had one of the rows with three colors labeled O instead of Other? Eg, one of these rows?

7 0 Live_In

Or did you have all of them labeled O instead of Other? Or something else entirely?

I've never looked at this code before, but it does sound like the error could use some clarification if all you got was a NullPointerException because something had an unexpected label, so I'd like to change that.

iresiragusa commented 3 years ago

The sample in which I found the error was this one:

3   O   0   O   IN  In  O   O   O
3   Other   1   O   CD  1958    O   O   O
3   Other   2   O   NNP Bishop  O   O   O
3   Peop    3   O   NNP Oxnam   O   O   O
3   O   4   O   VBD was O   O   O
3   O   5   O   JJ  successful  O   O   O
3   O   6   O   IN  in  O   O   O
3   O   7   O   VBG helping O   O   O
3   O   8   O   IN  to  O   O   O
3   O   9   O   VBN found   O   O   O
3   O   10  O   DT  the O   O   O
3   Org 11  O   NNP/IN/NNP/NNP  School/of/International/Service O   O   O
3   O   12  O   -LRB-   (   O   O   O
3   O   13  O   NNP SIS O   O   O
3   O   14  O   -RRB-   )   O   O   O
3   O   15  O   IN  at  O   O   O
3   Org 16  O   NNP/NNP American/University O   O   O
3   O   17  O   DT  the O   O   O
3   O   18  O   JJ  national    O   O   O
3   Other   19  O   JJ  Methodist   O   O   O
3   O   20  O   NN  university  O   O   O
3   O   21  O   IN  in  O   O   O
3   Loc 22  O   NNP Washington  O   O   O
3   Loc 23  O   NNP D.C O   O   O
3   O   24  O   .   .   O   O   O

11  13  alternate_name

As you can see, the 11th word is a org-type entity that is in a relationship of alternate_name with the 13th one. I think that the program, since the 13th is a O-type entity, gets a null pointer exception because it allows only relations among Loc, Org, Peop, Other. By adding a check in the program that generates the corpus, this relation wasn't annotated and the train went good.

I hope I was clear.

AngledLuffa commented 3 years ago

Thanks! That description made it very easy to replicate. We actually don't track line numbers in that code for now, but it's pretty straightforward to at least print out what the offending line was with more description of the problem than just NullPointerException

On Fri, Sep 3, 2021 at 12:45 AM irenesiragusa @.***> wrote:

The sample in which I found the error was this one:

3 O 0 O IN In O O O 3 Other 1 O CD 1958 O O O 3 Other 2 O NNP Bishop O O O 3 Peop 3 O NNP Oxnam O O O 3 O 4 O VBD was O O O 3 O 5 O JJ successful O O O 3 O 6 O IN in O O O 3 O 7 O VBG helping O O O 3 O 8 O IN to O O O 3 O 9 O VBN found O O O 3 O 10 O DT the O O O 3 Org 11 O NNP/IN/NNP/NNP School/of/International/Service O O O 3 O 12 O -LRB- ( O O O 3 O 13 O NNP SIS O O O 3 O 14 O -RRB- ) O O O 3 O 15 O IN at O O O 3 Org 16 O NNP/NNP American/University O O O 3 O 17 O DT the O O O 3 O 18 O JJ national O O O 3 Other 19 O JJ Methodist O O O 3 O 20 O NN university O O O 3 O 21 O IN in O O O 3 Loc 22 O NNP Washington O O O 3 Loc 23 O NNP D.C O O O 3 O 24 O . . O O O

11 13 alternate_name

As you can see, the 11th word is a org-type entity that is in a relationship of alternate_name with the 13th one. I think that the program, since the 13th is a O-type entity, gets a null pointer exception because it allows only relations among Loc, Org, Peop, Other. By adding a check in the program that generates the corpus, this relation wasn't annotated and the train went good.

I hope I was clear.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1177#issuecomment-912331134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMJM7LONRZYTTHPIE3UAB4INANCNFSM5DADY45Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.