Feature extractors returning more than one feature with same name cause IndexOutOfBoundsException

google-code-export / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc

Other

1 stars 0 forks source link

Feature extractors returning more than one feature with same name cause IndexOutOfBoundsException #95

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

This might not be the expected behaviour. We should check for this early and 
throw a better exception. One location to do so, is the Instance class. We 
might also re-implement the Feature class (is a TODO anyways) to specify that 
features are equal when they have the same name. That would make it easier, to 
check for Feature Extractors returning faulty feature lists.
Opinions?

Original issue reported on code.google.com by daxenber...@gmail.com on 21 Feb 2014 at 8:57

GoogleCodeExporter commented 9 years ago

We definitely need to check that early.
As users can write new FEs, and will usually have no overview about what 
feature names are used so far, we need to inform the user about that.

Other solutions, e.g. prefixing each feature name with the name of the FE etc. 
make the feature space much bigger. So I would not like to go in this direction.

Original comment by torsten....@gmail.com on 21 Feb 2014 at 9:07

GoogleCodeExporter commented 9 years ago

As Torsten pointed out, this problem may also arise across feature extractors. 
Prefixing the names of the resp. FEs wouldn't fully solve the problem, as one 
FE can return a list of Features (possibly named equally).

Original comment by daxenber...@gmail.com on 21 Feb 2014 at 9:13

GoogleCodeExporter commented 9 years ago

Why would prefixing the FE name make the feature space bigger? Wouldn't it only 
make the feature names longer?

Original comment by richard.eckart on 21 Feb 2014 at 9:17

GoogleCodeExporter commented 9 years ago

Prefixes don't increase the number of features but only the space occupied for 
their names (which should be insignificant with the latest version of the 
feature store). Still, prefixes don't solve the problem (see comment #2).

Original comment by daxenber...@gmail.com on 21 Feb 2014 at 9:33

GoogleCodeExporter commented 9 years ago

But they reduce the risk. I would suppose that the risk to have a single FE 
producing multiple features with the same name is much lower than having 
different FEs producing conflicting features. Different FE implementers may not 
know about each other, but within a single FE, it should be possible to handle 
this by convention. It would also be possible to do an outside sanity check 
when fetching the features from the FE. I think overriding equals() and 
hashcode() may be a bit heavy. Although, if you did that, it would be easy 
return a set instead of a list.

Original comment by richard.eckart on 21 Feb 2014 at 9:52

GoogleCodeExporter commented 9 years ago

There might also be some problems when using ARFFs with very long feature names 
(not to speak about readability).

Regarding bigger feature space: I meant it makes the ARFF bigger, but that 
might be neglectable as the header is only stored once anyway.

Original comment by torsten....@gmail.com on 21 Feb 2014 at 10:11

GoogleCodeExporter commented 9 years ago

Why are we returning a list of features instead a set of features anyway?

We need to preserve feature ordering with the new feature space, but we could 
also ensure that by ordering the names alphabetically.

Original comment by torsten....@gmail.com on 21 Feb 2014 at 10:13

GoogleCodeExporter commented 9 years ago

Maintaining order in the store sounds like a good idea.

Original comment by richard.eckart on 21 Feb 2014 at 10:18

GoogleCodeExporter commented 9 years ago

Original comment by torsten....@gmail.com on 25 Apr 2014 at 8:25

Changed state: Started

GoogleCodeExporter commented 9 years ago

Using treeset in feature store instead of relying on implicit ordering in list.
Throw an exception when the same feature name is used twice.

This should solve the issue, but actually throws the exception pretty late.
Catching this earlier would involve changing the signature of feature 
extractors, which I do not want to do at the moment.

Leaving the issue open for now, as this is not fully resolved.

Original comment by torsten....@gmail.com on 25 Apr 2014 at 8:34

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 4 Jun 2014 at 3:46

Added labels: Milestone-Release0.7.0

GoogleCodeExporter commented 9 years ago

Original comment by daxenber...@gmail.com on 29 Aug 2014 at 10:11

Added labels: Priority-Low, Milestone-Release0.8.0
Removed labels: Priority-Medium, Milestone-Release0.7.0