irit-melodi / attelo

discourse parser
GNU General Public License v3.0
5 stars 11 forks source link

targetless datapacks #18

Open kowey opened 9 years ago

kowey commented 9 years ago

Observation that @moreymat made: for hygienic reasons, we really should not expose the datapack targets during decoding time. This is actually a bit trickier to implement than I'd anticipated.

Sure we could slap in something like

    def targetless(self):
        '''
        Return a variant of the datapack in which the target has
        been set to None.

        This is information that really should not be visible to
        during decoding time and is here for hygienic purposes

        :rtype: DataPack
        '''
        return DataPack(edus=self.edus,
                        pairings=self.pairings,
                        data=self.data,
                        target=None,
                        labels=self.labels)

The problem at the moment is the oracles. Our implementation of oracles is actually pretty crude. If the model is literally the string 'oracle', we return the datapack target… so the oracles for now need to see the targets.

OK so in principle you could say that Oracles should really be learners like any other, implementing some sort of fit function that trivially memorises the target. That's fine except that you also need to introduce another layer of bureaucracy to your test harnesses that “learns” the oracle on the test data instead of the training data… yuck

So I'm not sure which is worse.

moreymat commented 9 years ago

I think we can live with the possibility of a contamination for now. We can enforce the separation between data and targets later.

2015-03-20 9:23 GMT+01:00 Eric Kow notifications@github.com:

Observation that @moreymat https://github.com/moreymat made: for hygienic reasons, we really should not expose the datapack targets during decoding time. This is actually a bit trickier to implement than I'd anticipated.

Sure we could slap in something like

def targetless(self):
    '''        Return a variant of the datapack in which the target has        been set to None.        This is information that really should not be visible to        during decoding time and is here for hygienic purposes        :rtype: DataPack        '''
    return DataPack(edus=self.edus,
                    pairings=self.pairings,
                    data=self.data,
                    target=None,
                    labels=self.labels)

The problem at the moment is the oracles. Our implementation of oracles is actually pretty crude. If the model is literally the string 'oracle', we return the datapack target… so the oracles for now need to see the targets.

OK so in principle you could say that Oracles should really be learners like any other, implementing some sort of fit function that trivially memorises the target. That's fine except that you also need to introduce another layer of bureaucracy to your test harnesses that “learns” the oracle on the test data instead of the training data… yuck

So I'm not sure which is worse.

— Reply to this email directly or view it on GitHub https://github.com/kowey/attelo/issues/18.

kowey commented 9 years ago

Maybe one way to deal with this is to make target private by convention (._target). The oracles can grab them, but it's clear from the API that we Don't Approve otherwise.