gregdurrett / berkeley-entity

The Berkeley Entity Resolution System jointly solves the problems of named entity recognition, coreference resolution, and entity linking with a feature-rich discriminative model.
GNU General Public License v3.0
185 stars 35 forks source link

Mention pair pruning #6

Open joecheriross opened 8 years ago

joecheriross commented 8 years ago

Hi Greg,

When prunedEdges(i)(j) is true, does that mean the mention pair ith mention and jth mention is ignored(or avoided from further processing) ? I got confused when I printed the mention pairs after pruning.

code snippet

def printPrunedEdges(docGraphs:Seq[DocumentGraph])= {

      for(i <- 0 until docGraphs.size){
            println("PRUNED EDGES");

           for(j1<-0 until docGraphs(i).prunedEdges.size) {

               for(j2<-0 until docGraphs(i).prunedEdges(j1).size)

                       if(docGraphs(i).prunedEdges(j1)(j2) == true){

                          println(j1 + " " + docGraphs(i).getMention(j1).words + ": " + j2 + " " + docGraphs(i).getMention(j2).words);

                  }

           }

      }

  }
gregdurrett commented 8 years ago

Yes, that should be the case. Note that i is the index of the current mention and j is the index of the antecedent (so j < i, with j == i denoting the mention starting a new cluster).

Greg

On Tue, Dec 29, 2015 at 11:46 AM, Joe Cheri Ross notifications@github.com wrote:

Hi Greg,

When prunedEdges(i)(j) is true, does that mean the mention pair ith mention and jth mention is ignored(or avoided from further processing) ? I got confused when I printed the mention pair which after pruning

code snippet

def printPrunedEdges(docGraphs:Seq[DocumentGraph])= {

  for(i <- 0 until docGraphssize){
        println("PRUNED EDGES");

       for(j1<-0 until docGraphs(i)prunedEdgessize) {

           for(j2<-0 until docGraphs(i)prunedEdges(j1)size)

                   if(docGraphs(i)prunedEdges(j1)(j2) == true){

                      println(j1 + " " + docGraphs(i)getMention(j1)words + ": " + j2 + " " + docGraphs(i)getMention(j2)words);

              }

       }

  }

}

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/6.

joecheriross commented 8 years ago

Thank you Greg. One more doubt. The command line I am using has -pruningStrategy pointing to a corefprune model file. What does this mean ? Pruning is learned and stored as a model ? For my purpose I am extending the distance pruning. Is this ok ?

On Wed, Dec 30, 2015 at 8:49 AM, Greg Durrett notifications@github.com wrote:

Yes, that should be the case. Note that i is the index of the current mention and j is the index of the antecedent (so j < i, with j == i denoting the mention starting a new cluster).

Greg

On Tue, Dec 29, 2015 at 11:46 AM, Joe Cheri Ross <notifications@github.com

wrote:

Hi Greg,

When prunedEdges(i)(j) is true, does that mean the mention pair ith mention and jth mention is ignored(or avoided from further processing) ? I got confused when I printed the mention pair which after pruning

code snippet

def printPrunedEdges(docGraphs:Seq[DocumentGraph])= {

for(i <- 0 until docGraphssize){ println("PRUNED EDGES");

for(j1<-0 until docGraphs(i)prunedEdgessize) {

for(j2<-0 until docGraphs(i)prunedEdges(j1)size)

if(docGraphs(i)prunedEdges(j1)(j2) == true){

println(j1 + " " + docGraphs(i)getMention(j1)words + ": " + j2 + " " + docGraphs(i)getMention(j2)words);

}

}

}

}

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/6.

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-167929501 .

gregdurrett commented 8 years ago

Yes. That method of using pruning prunes according to the marginals of a pre-trained model. I mostly used it for pruning in more sophisticated stuff like the full entity system. For coref-only stuff, I only use the basic distance pruning (which in reality doesn't prune at all) so that should be fine to extend.

Greg

On Tue, Dec 29, 2015 at 11:40 PM, Joe Cheri Ross notifications@github.com wrote:

Thank you Greg. One more doubt. The command line I am using has -pruningStrategy pointing to a corefprune model file. What does this mean ? Pruning is learned and stored as a model ? For my purpose I am extending the distance pruning. Is this ok ?

On Wed, Dec 30, 2015 at 8:49 AM, Greg Durrett notifications@github.com wrote:

Yes, that should be the case. Note that i is the index of the current mention and j is the index of the antecedent (so j < i, with j == i denoting the mention starting a new cluster).

Greg

On Tue, Dec 29, 2015 at 11:46 AM, Joe Cheri Ross < notifications@github.com

wrote:

Hi Greg,

When prunedEdges(i)(j) is true, does that mean the mention pair ith mention and jth mention is ignored(or avoided from further processing) ? I got confused when I printed the mention pair which after pruning

code snippet

def printPrunedEdges(docGraphs:Seq[DocumentGraph])= {

for(i <- 0 until docGraphssize){ println("PRUNED EDGES");

for(j1<-0 until docGraphs(i)prunedEdgessize) {

for(j2<-0 until docGraphs(i)prunedEdges(j1)size)

if(docGraphs(i)prunedEdges(j1)(j2) == true){

println(j1 + " " + docGraphs(i)getMention(j1)words + ": " + j2 + " " + docGraphs(i)getMention(j2)words);

}

}

}

}

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/6.

— Reply to this email directly or view it on GitHub < https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-167929501

.

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-167934568 .

joecheriross commented 8 years ago

Thanks Greg. I will do that.

Sharing one observation. While experimenting with the pretrained model on my test data, I found that many of the required mention pairs are getting pruned. I have not verified this thoroughly. But I am almost sure that this is happening.

Thanks, Joe

On Thu, Dec 31, 2015 at 7:27 AM, Greg Durrett notifications@github.com wrote:

Yes. That method of using pruning prunes according to the marginals of a pre-trained model. I mostly used it for pruning in more sophisticated stuff like the full entity system. For coref-only stuff, I only use the basic distance pruning (which in reality doesn't prune at all) so that should be fine to extend.

Greg

On Tue, Dec 29, 2015 at 11:40 PM, Joe Cheri Ross <notifications@github.com

wrote:

Thank you Greg. One more doubt. The command line I am using has -pruningStrategy pointing to a corefprune model file. What does this mean ? Pruning is learned and stored as a model ? For my purpose I am extending the distance pruning. Is this ok ?

On Wed, Dec 30, 2015 at 8:49 AM, Greg Durrett notifications@github.com wrote:

Yes, that should be the case. Note that i is the index of the current mention and j is the index of the antecedent (so j < i, with j == i denoting the mention starting a new cluster).

Greg

On Tue, Dec 29, 2015 at 11:46 AM, Joe Cheri Ross < notifications@github.com

wrote:

Hi Greg,

When prunedEdges(i)(j) is true, does that mean the mention pair ith mention and jth mention is ignored(or avoided from further processing) ? I got confused when I printed the mention pair which after pruning

code snippet

def printPrunedEdges(docGraphs:Seq[DocumentGraph])= {

for(i <- 0 until docGraphssize){ println("PRUNED EDGES");

for(j1<-0 until docGraphs(i)prunedEdgessize) {

for(j2<-0 until docGraphs(i)prunedEdges(j1)size)

if(docGraphs(i)prunedEdges(j1)(j2) == true){

println(j1 + " " + docGraphs(i)getMention(j1)words + ": " + j2 + " " + docGraphs(i)getMention(j2)words);

}

}

}

}

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/6.

— Reply to this email directly or view it on GitHub <

https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-167929501

.

— Reply to this email directly or view it on GitHub < https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-167934568

.

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-168109709 .

gregdurrett commented 8 years ago

There should be some line printed (starting with the word "Pruning" I think) that tells you about this. Many gold arcs are pruned but the model is pretty good about not deleting every gold arc from a mention (as in, some gold arc should be preserved >90% of the time). And those preserved gold arcs are the ones that are picked anyway (e.g. close links for pronouns), so from the standpoint of the downstream model this is okay.

Greg

On Wed, Dec 30, 2015 at 9:02 PM, Joe Cheri Ross notifications@github.com wrote:

Thanks Greg. I will do that.

Sharing one observation. While experimenting with the pretrained model on my test data, I found that many of the required mention pairs are getting pruned. I have not verified this thoroughly. But I am almost sure that this is happening.

Thanks, Joe

On Thu, Dec 31, 2015 at 7:27 AM, Greg Durrett notifications@github.com

wrote:

Yes. That method of using pruning prunes according to the marginals of a pre-trained model. I mostly used it for pruning in more sophisticated stuff like the full entity system. For coref-only stuff, I only use the basic distance pruning (which in reality doesn't prune at all) so that should be fine to extend.

Greg

On Tue, Dec 29, 2015 at 11:40 PM, Joe Cheri Ross < notifications@github.com

wrote:

Thank you Greg. One more doubt. The command line I am using has -pruningStrategy pointing to a corefprune model file. What does this mean ? Pruning is learned and stored as a model ? For my purpose I am extending the distance pruning. Is this ok ?

On Wed, Dec 30, 2015 at 8:49 AM, Greg Durrett < notifications@github.com> wrote:

Yes, that should be the case. Note that i is the index of the current mention and j is the index of the antecedent (so j < i, with j == i denoting the mention starting a new cluster).

Greg

On Tue, Dec 29, 2015 at 11:46 AM, Joe Cheri Ross < notifications@github.com

wrote:

Hi Greg,

When prunedEdges(i)(j) is true, does that mean the mention pair ith mention and jth mention is ignored(or avoided from further processing) ? I got confused when I printed the mention pair which after pruning

code snippet

def printPrunedEdges(docGraphs:Seq[DocumentGraph])= {

for(i <- 0 until docGraphssize){ println("PRUNED EDGES");

for(j1<-0 until docGraphs(i)prunedEdgessize) {

for(j2<-0 until docGraphs(i)prunedEdges(j1)size)

if(docGraphs(i)prunedEdges(j1)(j2) == true){

println(j1 + " " + docGraphs(i)getMention(j1)words + ": " + j2 + " " + docGraphs(i)getMention(j2)words);

}

}

}

}

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/6.

— Reply to this email directly or view it on GitHub <

https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-167929501

.

— Reply to this email directly or view it on GitHub <

https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-167934568

.

— Reply to this email directly or view it on GitHub < https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-168109709

.

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-168110253 .

joecheriross commented 8 years ago

Ok got it. The point is though many gold arcs get deleted, final accuracy is not much affected since the essential ones are preserved.

Thanks, Joe

On Thu, Dec 31, 2015 at 7:38 AM, Greg Durrett notifications@github.com wrote:

There should be some line printed (starting with the word "Pruning" I think) that tells you about this. Many gold arcs are pruned but the model is pretty good about not deleting every gold arc from a mention (as in, some gold arc should be preserved >90% of the time). And those preserved gold arcs are the ones that are picked anyway (e.g. close links for pronouns), so from the standpoint of the downstream model this is okay.

Greg

On Wed, Dec 30, 2015 at 9:02 PM, Joe Cheri Ross notifications@github.com

wrote:

Thanks Greg. I will do that.

Sharing one observation. While experimenting with the pretrained model on my test data, I found that many of the required mention pairs are getting pruned. I have not verified this thoroughly. But I am almost sure that this is happening.

Thanks, Joe

On Thu, Dec 31, 2015 at 7:27 AM, Greg Durrett notifications@github.com

wrote:

Yes. That method of using pruning prunes according to the marginals of a pre-trained model. I mostly used it for pruning in more sophisticated stuff like the full entity system. For coref-only stuff, I only use the basic distance pruning (which in reality doesn't prune at all) so that should be fine to extend.

Greg

On Tue, Dec 29, 2015 at 11:40 PM, Joe Cheri Ross < notifications@github.com

wrote:

Thank you Greg. One more doubt. The command line I am using has -pruningStrategy pointing to a corefprune model file. What does this mean ? Pruning is learned and stored as a model ? For my purpose I am extending the distance pruning. Is this ok ?

On Wed, Dec 30, 2015 at 8:49 AM, Greg Durrett < notifications@github.com> wrote:

Yes, that should be the case. Note that i is the index of the current mention and j is the index of the antecedent (so j < i, with j == i denoting the mention starting a new cluster).

Greg

On Tue, Dec 29, 2015 at 11:46 AM, Joe Cheri Ross < notifications@github.com

wrote:

Hi Greg,

When prunedEdges(i)(j) is true, does that mean the mention pair ith mention and jth mention is ignored(or avoided from further processing) ? I got confused when I printed the mention pair which after pruning

code snippet

def printPrunedEdges(docGraphs:Seq[DocumentGraph])= {

for(i <- 0 until docGraphssize){ println("PRUNED EDGES");

for(j1<-0 until docGraphs(i)prunedEdgessize) {

for(j2<-0 until docGraphs(i)prunedEdges(j1)size)

if(docGraphs(i)prunedEdges(j1)(j2) == true){

println(j1 + " " + docGraphs(i)getMention(j1)words + ": " + j2 + " " + docGraphs(i)getMention(j2)words);

}

}

}

}

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/6.

— Reply to this email directly or view it on GitHub <

https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-167929501

.

— Reply to this email directly or view it on GitHub <

https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-167934568

.

— Reply to this email directly or view it on GitHub <

https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-168109709

.

— Reply to this email directly or view it on GitHub < https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-168110253

.

— Reply to this email directly or view it on GitHub https://github.com/gregdurrett/berkeley-entity/issues/6#issuecomment-168110579 .