token filter and accepted words conflict

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Use a token filter that contains some set of words
2. Use an accepted word list that contains a disjoint set of words
3. run any semantic space main and see that the disjoint accepted words do not 
get represented.

What is the expected output? What do you see instead?
Since we apply token filtering to all token streams prior to any token being 
selected for representation in a word space, it's impossible to represent words 
that are also not counted as features.  For instance, if you wanted to 
represent the top 1M words in wikipedia using the top 100k words as features, 
you would only be able to represent the top 100k words, and be forced to ignore 
the last 900k.  

In order to fix this, we'd likely have to heavily restructure the token 
filtering process.  One option is to have the token filter only handle compound 
words, and then if an included word list is given, a precomputed basis mapping 
is given to the SemanticSpace algorithm.  Then, the SemanticSpace algorithm can 
query the basis mapping for a valid word feature and use words that have 
non-negative mappings.  They can also query the accepted word set.  With this, 
every SemanticSpace would see every token in the document, even if it's been 
filtered out, but have enough information to discard it.

The biggest complication with this method would be compound features, such as 
word-{order,pos,relation} styled features which append some added information 
on the base word form.  This would require having one basis mapping for 
validation of the base word form, and then another basis mapping that actually 
represents the full feature form, possibly consuming quite a bit of ram.

Original issue reported on code.google.com by FozzietheBeat@gmail.com on 25 Aug 2011 at 8:15

GoogleCodeExporter commented 9 years ago

So the core issue is that any SemanticSpace needs information on:
  1. what tokens it should count as features
  2. what tokens it should keep representations for

The BasisMapping provides the first one, and the Filterable interface supports 
the second.  However, we don't have any way of building an input representation 
to the SemanticSpace that supports both.  Is that the case?

Could we roll this into the SemanticSpace Document API change, where a Document 
would now handle to the tokenization?  Also, how would this work for the 
dependency-tree based SemanticSpace implementations, or is this a co-occurrence 
based issue only?

Original comment by David.Ju...@gmail.com on 25 Aug 2011 at 9:21

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Correct.  We do have the tools to make this work, and giving both to the 
SemanticSpaces is a reasonable solution to me.  I don't think we need, but 
having one would always be awesome, a unified way of handling both these 
issues, as long as they both get handled in a reasonable manner.

I do think that this would be a pretty large overhaul and fits in nicely with 
the api change.  For the dependency tree models, we'd still need the Filterable 
feature to handle which words needs a representation and I would think we could 
to pass the feature list as a DependencyPathAcceptor, i.e. one that accepts 
only things in the feature list, and then let the dependency tree models do 
their mapping however they wish with those accepted paths.

Original comment by FozzietheBeat@gmail.com on 25 Aug 2011 at 9:39

Added labels: ****
Removed labels: ****

sunandap / airhead-research

token filter and accepted words conflict #100