irit-melodi / attelo

discourse parser
GNU General Public License v3.0
5 stars 11 forks source link

possible variable mutability bug #12

Open kowey opened 9 years ago

kowey commented 9 years ago

I'm getting the impression that variables in Orange are global. For example, if our attachment table has CLASS = {True, False}, and our relations table has CLASS = {Elaboration, Narration...}, querying Orange about the CLASS variable for either table will yield the union of the two (CLASS = {True, False, Elaboration, Narration}).

Moreover our code assumes that it can look up values by numerical index (True = 1), and seems to be doing this to pick out probabilities… which is bad if for example you happen to fall on a dataset with only True or only False… or that plus relations, in which case looking up index 1 might get you the probability associated with some relation label like "Narration"

kowey commented 9 years ago

Oh dear! small demonstrator. Indeed:

Orange considers two variables (e.g. in two different data tables) the same if they have the same descriptor. It is allowed - but not recommended - to have different descriptors with the same name.

(from the Orange manual)

gist

kowey commented 9 years ago

So as far as I can tell, it's wrong, but the consequences of this aren't too too serious:

It's worth fixing still, my current thought being that we have to introduce a new config item for 'RelLabel' (which defaults to that of Label if not present) to pull the two variables apart.

moreymat commented 9 years ago

"Orange considered harmful" seems to be our major finding insofar. On 30 Dec 2014 13:31, "Eric Kow" notifications@github.com wrote:

So as far as I can tell, it's wrong, but the consequences of this aren't too too serious:

  • in the corner case where your attachment classifier tells you everything or nothing is attached, you'll be picking bogus probabilities (that assigned to one of the relation labels due to conflation of the CLASS variables); but this is really on account of the numerical indexing of the prob dist (ie. picking out attach_prob[1] instead of attach_prob['True'])
  • classifiers will slosh some probability mass over to the unrelated values, relation labels for attach, and attachment T/F decisions for labelling
  • sometimes, we may think that the relation label for a pair of instances is 'True' or 'False', and not eg. 'Elaboration'

It's worth fixing still, my current thought being that we have to introduce a new config item for 'RelLabel' (which defaults to that of Label if not present) to pull the two variables apart.

— Reply to this email directly or view it on GitHub https://github.com/kowey/attelo/issues/12#issuecomment-68352756.

kowey commented 9 years ago

And actually, I discovered this entirely by accident while looking through our code trying to work out what the de-Orangeing refactor might look like.

moreymat commented 9 years ago

More seriously this is a critical trap laid by orange. I had stumbled upon this indirect way to retrieve prediction scores and was worried this could trick us in unexpected ways. Here we are. On 30 Dec 2014 14:30, "Mathieu Morey" mathieu.morey@gmail.com wrote:

"Orange considered harmful" seems to be our major finding insofar. On 30 Dec 2014 13:31, "Eric Kow" notifications@github.com wrote:

So as far as I can tell, it's wrong, but the consequences of this aren't too too serious:

  • in the corner case where your attachment classifier tells you everything or nothing is attached, you'll be picking bogus probabilities (that assigned to one of the relation labels due to conflation of the CLASS variables); but this is really on account of the numerical indexing of the prob dist (ie. picking out attach_prob[1] instead of attach_prob['True'])
  • classifiers will slosh some probability mass over to the unrelated values, relation labels for attach, and attachment T/F decisions for labelling
  • sometimes, we may think that the relation label for a pair of instances is 'True' or 'False', and not eg. 'Elaboration'

It's worth fixing still, my current thought being that we have to introduce a new config item for 'RelLabel' (which defaults to that of Label if not present) to pull the two variables apart.

— Reply to this email directly or view it on GitHub https://github.com/kowey/attelo/issues/12#issuecomment-68352756.

moreymat commented 9 years ago

And more accurately, this way of retrieving prediction scores is standard among machine learning libs but feels weirdly low level in the orange context. Hard to pinpoint and highly subjective, but this sort of impression undermines my confidence in whether i (we) use the library correctly. On 30 Dec 2014 14:35, "Mathieu Morey" mathieu.morey@gmail.com wrote:

More seriously this is a critical trap laid by orange. I had stumbled upon this indirect way to retrieve prediction scores and was worried this could trick us in unexpected ways. Here we are. On 30 Dec 2014 14:30, "Mathieu Morey" mathieu.morey@gmail.com wrote:

"Orange considered harmful" seems to be our major finding insofar. On 30 Dec 2014 13:31, "Eric Kow" notifications@github.com wrote:

So as far as I can tell, it's wrong, but the consequences of this aren't too too serious:

  • in the corner case where your attachment classifier tells you everything or nothing is attached, you'll be picking bogus probabilities (that assigned to one of the relation labels due to conflation of the CLASS variables); but this is really on account of the numerical indexing of the prob dist (ie. picking out attach_prob[1] instead of attach_prob['True'])
  • classifiers will slosh some probability mass over to the unrelated values, relation labels for attach, and attachment T/F decisions for labelling
  • sometimes, we may think that the relation label for a pair of instances is 'True' or 'False', and not eg. 'Elaboration'

It's worth fixing still, my current thought being that we have to introduce a new config item for 'RelLabel' (which defaults to that of Label if not present) to pull the two variables apart.

— Reply to this email directly or view it on GitHub https://github.com/kowey/attelo/issues/12#issuecomment-68352756.

kowey commented 9 years ago

Not entirely sure what you mean when you say “this way of retrieving prediction scores”…

moreymat commented 9 years ago

The prediction of a classifier for a sample can be output as a list of scores that correspond element wise to the list of class labels. On 30 Dec 2014 15:09, "Eric Kow" notifications@github.com wrote:

Not entirely sure what you mean when you say “this way of retrieving prediction scores”…

— Reply to this email directly or view it on GitHub https://github.com/kowey/attelo/issues/12#issuecomment-68358677.