ufal / nametag

NameTag: Named Entity Tagger
Mozilla Public License 2.0
38 stars 10 forks source link

Why can't two words have same brown cluster representation? #5

Closed curusarn closed 7 years ago

curusarn commented 7 years ago

When I run train_ner with BrownClusters feature enabled I get following output:

Loading train data: done, 8158 sentences
Loading heldout data: done, 899 sentences
Parsing feature templates: Form '0000000' is present twice in Brown cluster file 'clusters/cs_brown_1000'!
Cannot initialize feature template sentence processor 'BrownClusters' from line 'BrownClusters/2 clusters/cs_brown_1000' of feature templates file!

Why exactly can't be Form '0000000' present twice?
It seems like a unnecessary limitation. As far as I know all words with the same prefix belong into one cluster. Therefore any additional bits after chosen prefix are irrelevant. (Eg. with prefix of length 20 any bits after 20th bit are irrelevant.) Am I missing something?

Best regards. Simon Let

foxik commented 7 years ago

The error means that NameTag thinks word 0000000 is in multiple clusters (and fails because it does not know which cluster to use).

The input file for the BrownCluster feature should contain lines with cluster<tab>lemma -- don't you have it reversed (i.e., lemma<tab>cluster)? The "form 0000000" looks more like a cluster.

If you really have one form present multiple times in the file, you have to decide which one to use yourself.

curusarn commented 7 years ago

The BrownCluster feature file was reversed (lemmacluster). Works as expected.

Thanks for your time.