Open kowey opened 9 years ago
Oh dear! small demonstrator. Indeed:
Orange considers two variables (e.g. in two different data tables) the same if they have the same descriptor. It is allowed - but not recommended - to have different descriptors with the same name.
(from the Orange manual)
So as far as I can tell, it's wrong, but the consequences of this aren't too too serious:
attach_prob[1]
instead of attach_prob['True']
, ie. our fault)It's worth fixing still, my current thought being that we have to introduce a new config item for 'RelLabel' (which defaults to that of Label if not present) to pull the two variables apart.
"Orange considered harmful" seems to be our major finding insofar. On 30 Dec 2014 13:31, "Eric Kow" notifications@github.com wrote:
So as far as I can tell, it's wrong, but the consequences of this aren't too too serious:
- in the corner case where your attachment classifier tells you everything or nothing is attached, you'll be picking bogus probabilities (that assigned to one of the relation labels due to conflation of the CLASS variables); but this is really on account of the numerical indexing of the prob dist (ie. picking out attach_prob[1] instead of attach_prob['True'])
- classifiers will slosh some probability mass over to the unrelated values, relation labels for attach, and attachment T/F decisions for labelling
- sometimes, we may think that the relation label for a pair of instances is 'True' or 'False', and not eg. 'Elaboration'
It's worth fixing still, my current thought being that we have to introduce a new config item for 'RelLabel' (which defaults to that of Label if not present) to pull the two variables apart.
— Reply to this email directly or view it on GitHub https://github.com/kowey/attelo/issues/12#issuecomment-68352756.
And actually, I discovered this entirely by accident while looking through our code trying to work out what the de-Orangeing refactor might look like.
More seriously this is a critical trap laid by orange. I had stumbled upon this indirect way to retrieve prediction scores and was worried this could trick us in unexpected ways. Here we are. On 30 Dec 2014 14:30, "Mathieu Morey" mathieu.morey@gmail.com wrote:
"Orange considered harmful" seems to be our major finding insofar. On 30 Dec 2014 13:31, "Eric Kow" notifications@github.com wrote:
So as far as I can tell, it's wrong, but the consequences of this aren't too too serious:
- in the corner case where your attachment classifier tells you everything or nothing is attached, you'll be picking bogus probabilities (that assigned to one of the relation labels due to conflation of the CLASS variables); but this is really on account of the numerical indexing of the prob dist (ie. picking out attach_prob[1] instead of attach_prob['True'])
- classifiers will slosh some probability mass over to the unrelated values, relation labels for attach, and attachment T/F decisions for labelling
- sometimes, we may think that the relation label for a pair of instances is 'True' or 'False', and not eg. 'Elaboration'
It's worth fixing still, my current thought being that we have to introduce a new config item for 'RelLabel' (which defaults to that of Label if not present) to pull the two variables apart.
— Reply to this email directly or view it on GitHub https://github.com/kowey/attelo/issues/12#issuecomment-68352756.
And more accurately, this way of retrieving prediction scores is standard among machine learning libs but feels weirdly low level in the orange context. Hard to pinpoint and highly subjective, but this sort of impression undermines my confidence in whether i (we) use the library correctly. On 30 Dec 2014 14:35, "Mathieu Morey" mathieu.morey@gmail.com wrote:
More seriously this is a critical trap laid by orange. I had stumbled upon this indirect way to retrieve prediction scores and was worried this could trick us in unexpected ways. Here we are. On 30 Dec 2014 14:30, "Mathieu Morey" mathieu.morey@gmail.com wrote:
"Orange considered harmful" seems to be our major finding insofar. On 30 Dec 2014 13:31, "Eric Kow" notifications@github.com wrote:
So as far as I can tell, it's wrong, but the consequences of this aren't too too serious:
- in the corner case where your attachment classifier tells you everything or nothing is attached, you'll be picking bogus probabilities (that assigned to one of the relation labels due to conflation of the CLASS variables); but this is really on account of the numerical indexing of the prob dist (ie. picking out attach_prob[1] instead of attach_prob['True'])
- classifiers will slosh some probability mass over to the unrelated values, relation labels for attach, and attachment T/F decisions for labelling
- sometimes, we may think that the relation label for a pair of instances is 'True' or 'False', and not eg. 'Elaboration'
It's worth fixing still, my current thought being that we have to introduce a new config item for 'RelLabel' (which defaults to that of Label if not present) to pull the two variables apart.
— Reply to this email directly or view it on GitHub https://github.com/kowey/attelo/issues/12#issuecomment-68352756.
Not entirely sure what you mean when you say “this way of retrieving prediction scores”…
The prediction of a classifier for a sample can be output as a list of scores that correspond element wise to the list of class labels. On 30 Dec 2014 15:09, "Eric Kow" notifications@github.com wrote:
Not entirely sure what you mean when you say “this way of retrieving prediction scores”…
— Reply to this email directly or view it on GitHub https://github.com/kowey/attelo/issues/12#issuecomment-68358677.
I'm getting the impression that variables in Orange are global. For example, if our attachment table has
CLASS = {True, False}
, and our relations table hasCLASS = {Elaboration, Narration...}
, querying Orange about the CLASS variable for either table will yield the union of the two (CLASS = {True, False, Elaboration, Narration}).Moreover our code assumes that it can look up values by numerical index (True = 1), and seems to be doing this to pick out probabilities… which is bad if for example you happen to fall on a dataset with only True or only False… or that plus relations, in which case looking up index 1 might get you the probability associated with some relation label like "Narration"