karensg / crowd-summary

Crowd Summary Tool
0 stars 1 forks source link

Summarizers #13

Closed MBrouns closed 10 years ago

MBrouns commented 10 years ago

I made a small start on implementing the tf-idf for getting a list of most important words in the document. @fabcouwer do you think you can further expand this implementation?

I've taken a quick look at the classifiers that classifier4j offers and it seems that the VectorClassifier should be usable for us to implement for ranking the quality of sentences that are to be put into the database. The problem with this method is that you cannot incrementally add training data which means that the entire training has to be done every single time. However, the performance is better then their normal bayes classifier.

Implementation of the classifier seems like quite some work so maybe it's best if we take a look at it together next week. In the meantime though it's quite important to get some training data so we can actually start using it when we want to implement. I'm not sure if @fabcouwer has time for this but maybe @timsweep can join him?

fabcouwer commented 10 years ago

There are some undefined methods in Utilities like getMostFrequentWords and getWordFrequency. Are these the ones yet to be implemented or am I missing something?

MBrouns commented 10 years ago

Is the classifier4j jar path correctly set in your classpath?

On Fri, Mar 21, 2014 at 10:10 AM, Friso Abcouwer notifications@github.comwrote:

There are some undefined methods in Utilities like getMostFrequentWords and getWordFrequency. Are these the ones yet to be implemented or am I missing something?

Reply to this email directly or view it on GitHubhttps://github.com/yetti4/crowd-summary/pull/13#issuecomment-38258833 .

fabcouwer commented 10 years ago

It is. The missing methods are in javax.swing.text.Utilities anyway

fabcouwer commented 10 years ago

Ah, -import net.sf.classifier4J.Utilities. Never mind!

MBrouns commented 10 years ago

Oh what I remembered by the way, we don't really need our own implementation for the tf, I think we can use the Utilities.getwordfrequency for that. We can then iterate over that list and calculate the idf for those terms. Advantage is that it has a default stop word ignore list so the list will be a lot smaller than all the words in the document. Could shave some time of the calculation

On Fri, Mar 21, 2014 at 10:15 AM, Friso Abcouwer notifications@github.comwrote:

Ah, -import net.sf.classifier4J.Utilities. Never mind!

Reply to this email directly or view it on GitHubhttps://github.com/yetti4/crowd-summary/pull/13#issuecomment-38259104 .

fabcouwer commented 10 years ago

:+1: I'll check it out

fabcouwer commented 10 years ago

I improved the TF/IDF calculation, but a lot of it is still clunky. I expect I'll have time to look at it on Tuesday, otherwise we'll just continue working on it on Wednesday.