senarvi / theanolm

TheanoLM is a recurrent neural network language modeling tool implemented using Theano
Apache License 2.0
81 stars 29 forks source link

supporting multi word classes #15

Closed pallavi0335 closed 5 years ago

pallavi0335 commented 8 years ago

Hello

I have word class file in SRILM format . A class contains multiple words in a line since SRILM support multi word classes. For example - PER (prob) Barack Obama while using theanoLM , Vocabulary.py file throws error "4 fields on one line of vocabulary file" Would it support multiple words in class file like SRILM .

senarvi commented 8 years ago

I have never encountered that in a class file. What should the program do when there are multiple words on the same line? Add both words to the same class with the same probability, or something else?

pallavi0335 commented 8 years ago

Yes, I have a specific case like I wanted to create classes of DBpedia entities so I assigned Barack Obama which is an entity in DBPedia in class PER. After classifying i used uniform classes function of SRILM to assign uniform probabilities to each class and entity.

senarvi commented 8 years ago

If you want to have Barack Obama as an entity, then you would also need Barack Obama in the vocabulary. You could achieve that by replacing all the occurrences of Barack Obama with Barack_Obama etc.

pallavi0335 commented 8 years ago

In my text data file, I have a sentence where Barack Obama is referred for example "Barack Obama and George Bush to speak Tuesday at Dallas memorial service for fallen police officers". There are three entities 2 person and 1 location so I tried using classes of these entities as a vocabulary to the language model.

senarvi commented 8 years ago

You could have the classes in a file like this:

PER 0.1 Barack_Obama
PER 0.1 George_Bush
LOC 0.2 Dallas
LOC 0.2 Seattle
....

Then replace spaces in entities with an underscore, so that your text file would look like this:

Barack_Obama and George_Bush to speak Tuesday at Dallas memorial service for fallen police officers
pallavi0335 commented 8 years ago

No.. this won't fit because in test file it won't be in this format and the purpose is to identify multi-word entities in test data as well??

senarvi commented 8 years ago

You would replace spaces with underscored in all the occurrences of the multi-word entities in the test data as well, as a preprocessing step. That would be the easiest way to achieve essentially the same thing as TheanoLM recognizing those entities and regarding them as a single vocabulary item.