louismullie / treat

Natural language processing framework for Ruby.
Other
1.36k stars 128 forks source link

Information extraction #7

Closed cryptomatictrader closed 12 years ago

cryptomatictrader commented 12 years ago

Thanks for the "Treat". I think the gem is the most comprehensive gem for NLP by far!

I'm trying to extract the keywords or topics given a sentence or paragraph. I read the "Topic Word Extraction" and "General Topic Extraction" in the link (https://github.com/louismullie/treat/wiki/Information-Extraction) and that's what I am looking for...

Questions:

Is there a way to perform "Topic Word Extraction" from a Paragraph class. It looks like "topic_words" supports Collection class only.

For "General Topic Extraction", is there a way to get the list of the topic words after doing the "topics" method? The example shows how to export it to a DOT file and and I see some other examples doing "print_tree". I am wondering if we can do something like to_a ?

Thanks!

ezkl commented 12 years ago

With respect to the first question, @louismullie's topic word extraction process is using Latent Dirichlet allocation via @ealdent's lda-ruby gem. I believe LDA necessitates a collection of documents, though @louismullie and @ealdent are the domain experts here.

louismullie commented 12 years ago

I'm no expert on LDA, but it was also my understanding that it was only applicable to collections of documents, which is why I put this limitation. Maybe @ealdent can confirm I understood correctly.

Concerning general topic extraction, the "topics" method itself returns the array of topics! If you call it twice, the topics will only be calculated the first time. The same array will be returned the second time.

cryptomatictrader commented 12 years ago

Thanks folks for the comments.

@louismullie , I don't quite follow. I did the below but got an error. Am I missing something? Can you give me an example? Thanks

1.9.3-p194 :004 > s = Paragraph 'Michigan, Ohio (Reuters) - Unfortunately, the RadioShack is closing.' => Paragraph (13674560) --- "Michigan, Ohio (Reuters) [...] is closing." --- {} --- []
1.9.3-p194 :005 > s.topics Treat::Exception: Method topics cannot be called on a paragraph. from /home/dev/.rvm/gems/ruby-1.9.3-p194/gems/treat-1.0.4/lib/treat/entities/entity.rb:107:in rescue in method_missing' from /home/dev/.rvm/gems/ruby-1.9.3-p194/gems/treat-1.0.4/lib/treat/entities/entity.rb:104:inmethod_missing' from (irb):5 from /home/dev/.rvm/rubies/ruby-1.9.3-p194/bin/irb:16:in `

'

louismullie commented 12 years ago

Thanks for pointing this out - this is an outdated example I need to fix. Right now, topics() will only be callable on a Document.

Now, we can discuss this though, because I remember being unsure of my decision when I changed the target of topics() from Zone (e.g. Paragraph) to Document. At what level should that method be available? Only documents? Sections? Zones (Title, Paragraph, List, etc.)? All three?

louismullie commented 12 years ago

Also, I'm afraid that the current installer doesn't take care of downloading the required models for the topics() method. I'll open an issue for that. @calvinchso For now you could just download the models at http://louismullie.com/treat/reuters/ and place them in treat-1.0.4/models/reuters/- I just uploaded the files.

cryptomatictrader commented 12 years ago

It will be nice if topics() can be accessed at the level of documents, sections and paragraphs (and even phrase and sentence). My use case is to get what topic(s) are about after users post a facebook update.

Also, just fyi - after I did "gem install treat" and "Treat.install", I also ran into the issue the "entagger" gem was not found and I had to manually install the gem. It may be just my set up or it may be something to be fixed at the installation level.

I will give it a try with the model files you put it on the web. Does that mean it can only be accessed by Document class even with those model files?

Thanks

louismullie commented 12 years ago

1 - Concerning the engtagger gem, I am not getting this issue here - can somebody else confirm?

2 - Currently, yes the topics() method will only be available on Document class. Giving that method multiple targets raises another issue. When using #do(:topics) on an entity, that entity will be recursively searched for any of the targets, and each will be annotated with its topics. That means that if do(:topics) is called on a document, the document, each section and each paragraph would be annotated. I am not sure if this acceptable behaviour. Thoughts on what should be the expected behaviour in that case? Anyways, you could do something like:

`document.each_paragraph { |p| p.topics }``

to get the annotation just on the paragraphs.

3 - Concerning your use case, I am not sure if the models I supplied (trained on Reuters news tickers) will be useful for what you want to accomplish. Facebook posts pose the special problem of internet spelling, and that might not work with the Reuters model. If I were you, I would try to create my own model. A gem that could be very useful for that is https://github.com/alexandru/stuff-classifier. It would be quite nice to have a wrapper for that under the AI::Classifiers module! Let me know if you are interested in contributing this.

louismullie commented 12 years ago

@calvinchso Just committed a patch to download the Reuters models automatically on install: 977cb2203deb07566f0090669233ea084ce30309.

louismullie commented 12 years ago

@calvinchso In the latest version, you can now get keywords on a document, section or zone 36e698eda38587aa7c9b666908d5bcf86818eb8f 36b4afe53bd3971b4e27151b2d41e157e95094ec c5eb65692da86f837f8c857096b625a58592aa80

ravengit commented 12 years ago

Cool. Will give it a try!

louismullie commented 12 years ago

I take back what I said, I forgot to include one of the commits in my latest gem push. Should be fixed by the end of the day.

louismullie commented 12 years ago

Fixed in latest version 1.0.6. Sorry again for the confusion.

Cheers guys, Louis