gregdurrett / berkeley-entity

The Berkeley Entity Resolution System jointly solves the problems of named entity recognition, coreference resolution, and entity linking with a feature-rich discriminative model.
GNU General Public License v3.0
185 stars 35 forks source link

Can the training be done with gold_conll files alone ? #4

Closed joecheriross closed 8 years ago

joecheriross commented 8 years ago

Hi,

I was trying to do training with ontonotes train data(*gold_conll) I have. When I give the train path to this data, the system is asking for auto_conll files as well. Can the training be done only with gold_conll files ? Correct me if there is some problem with my understanding.

Thanks, Joe

"Loading -1 docs from /home/joe/music_ontology/MusOntoLearning/ground_truth/ontonotes/train/ ending with auto_conll"

gregdurrett commented 8 years ago

Hi Joe,

Not sure exactly what mode of the system you're running. If you're running coref only, then the command-line argument you want is -corefDocSuffix "gold_conll"

The reason it defaults to auto_conll is that this is more standard for coref evaluation, but it'll work fine with gold as well.

Greg

joecheriross commented 8 years ago

Thanks Greg.

I think option 'corefDocSuffix' is not there. I tried with 'docSuffix'. That is not helping. So I got all the auto_conll files corresponding to the gold_conll files and tried. The training is happening. But it is not picking the test file from the test path. It is raising some iterator exception since it cannot find any test file. What is the filename format expected for a test conll file. I tried changing the filename ending with both 'auto_conll' and 'gold_conll'

Sorry to bother you.

gregdurrett commented 8 years ago

Hi Joe,

-corefDocSuffix was added in a newer commit, but it used to double with -docSuffix so that was the right thing to try. The suffix should be the same for train and test, so I'm not sure what the problem is. Can you send me the exact command you're running?

Greg

On Mon, Dec 7, 2015 at 12:07 AM, Joe Cheri Ross notifications@github.com wrote:

Thanks Greg.

I think option 'corefDocSuffix' is not there. I tried with 'docSuffix'. That is not helping. So I got all the auto_conll files corresponding to the gold_conll files and tried. The training is happening. But it is not picking the test file from the test path. It is raising some iterator exception since it cannot find any test file. What is the filename format expected for a test conll file. I tried changing the filename ending with both 'auto_conll' and 'gold_conll'

Sorry to bother you.

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/4#issuecomment-162443731 .

joecheriross commented 8 years ago

Hi Greg,

~/java-9-oracle/bin/java -Xmx8g -jar berkeley-entity-1.0.jar ++config/base.conf -execDir scratch -mode COREF_TRAIN_PREDICT -testPath /tmp/test/ -trainPath ./train/ -modelPath "models/joint-onto.ser.gz" -wikipediaPath "models/wiki-db-onto.ser.gz" -useGoldMentions -pruningStrategy build:models/cached/corefpruner-onto.ser.gz:-5:5 -nerPruningStrategy build:models/cached/nerpruner-onto.ser.gz:-9:5 -outputPath /tmp/test_output/

After adding auto_conll files to the trainPath along with gold_conll files, training is happening. But it is not taking test files.

"Loading -1 docs from /tmp/test_output/ ending with ERROR: java.util.NoSuchElementException: next on empty iterator: scala.collection.Iterator$$anon$2.next(Iterator.scala:39) scala.collection.Iterator$$anon$2.next(Iterator.scala:37) scala.collection.IndexedSeqLike$Elements.next(IndexedSeq "

Thanks, Joe

joecheriross commented 8 years ago

Hi Greg,

I could solve this. Thanks for your directions. I thought for train prediction, suffix need not be given. But it has to be given. A small suggestion; it is good to raise an exception when no suffix is provided as command line argument. It is not happening in COREF_TRAIN_PREDICT mode.

Also I am trying to do some pruning in mention pair formation in the testing phase. Can you please point to the file and function which I will have to edit for this.

Thanks, Joe

gregdurrett commented 8 years ago

Hi Joe,

Glad you were able to solve it!

CorefPruner controls pruning over mention pairs. In the runTrain method in CorefSystem.scala, we call:

CorefPruner.buildPruner(Driver.pruningStrategy).pruneAll(trainDocGraphs);

and an analogous call for test time in prepareTestDocuments. I would suggest subclassing CorefPruner appropriately and then building it from the passed in string argument. Right now we run the same pruning for train and test time, but you could fix this by adding a boolean flag to pruneAll indicating which phase it is.

Greg

On Mon, Dec 7, 2015 at 5:35 AM, Joe Cheri Ross notifications@github.com wrote:

Hi Greg,

I could solve this. Thanks for your directions. I thought for train prediction, suffix need not be given. But it has to be given. A small suggestion; it is good to raise an exception when no suffix is provided as command line argument. It is not happening in COREF_TRAIN_PREDICT mode.

Also I am trying to do some pruning in mention pair formation in the testing phase. Can you please point to the file and function which I will have to edit for this.

Thanks, Joe

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/4#issuecomment-162526101 .

joecheriross commented 8 years ago

Thank you Greg. I will try that.

Thanks, Joe

On Tue, Dec 8, 2015 at 6:50 AM, Greg Durrett notifications@github.com wrote:

Hi Joe,

Glad you were able to solve it!

CorefPruner controls pruning over mention pairs. In the runTrain method in CorefSystem.scala, we call:

CorefPruner.buildPruner(Driver.pruningStrategy).pruneAll(trainDocGraphs);

and an analogous call for test time in prepareTestDocuments. I would suggest subclassing CorefPruner appropriately and then building it from the passed in string argument. Right now we run the same pruning for train and test time, but you could fix this by adding a boolean flag to pruneAll indicating which phase it is.

Greg

On Mon, Dec 7, 2015 at 5:35 AM, Joe Cheri Ross notifications@github.com wrote:

Hi Greg,

I could solve this. Thanks for your directions. I thought for train prediction, suffix need not be given. But it has to be given. A small suggestion; it is good to raise an exception when no suffix is provided as command line argument. It is not happening in COREF_TRAIN_PREDICT mode.

Also I am trying to do some pruning in mention pair formation in the testing phase. Can you please point to the file and function which I will have to edit for this.

Thanks, Joe

— Reply to this email directly or view it on GitHub < https://github.com/gregdurrett/berkeley-entity/issues/4#issuecomment-162526101

.

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/4#issuecomment-162724848 .