Summarizers - Githubissues

MBrouns commented 10 years ago

Dit is een nieuwe versie van #16

Ik heb een kleine testcase gemaakt met de class4j vectorclassifier. Het lijkt er echter op dat die classifier niet getraind kan worden op meerdere sentences wat het dus compleet nutteloos maakt. Ik had eerder al hun bayesian classifier geprobeerd maar die returned als matching rating eigenlijk alleen 0.99 of 0.01 dus dat is ook niet heel nuttig.

Ik denk dat we van classifier4j moeten afstappen en ofwel stanfordNLP classifier ofwel MALLET moeten gebruiken.

Overigens, als we machine learning gaan gebruiken voor het raten van sentences moeten we er rekening mee houden dat we moeten weten in welke taal het document is. Ik denk dat we voor prototype wel kunnen zeggen dat we alleen engels ondersteunen maar we moeten het wel in de prestentatie ofzo noemen

MBrouns commented 10 years ago

@fabcouwer Kunnen we trouwens gezien je resultaat in #15 gewoon stanford nlp gebruiken en die andere methodes er uit halen?

fabcouwer commented 10 years ago

Dat is prima.

MBrouns commented 10 years ago

gebaseerd op dit (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.183.862&rep=rep1&type=pdf) artikel kunnen we een aantal features van zinnen onderscheiden die het logischer of minder logisch maken dat een zin voor de summary relevant is. Deze features zijn:

Zinslengte
Positie van de zin in de paragraaf
Positie van de zin in de gehele tekst
Similarity met de titel
Similarity met keywords
Aanwezigheid van eigennamen
Aanwezigheid van anaforen

We moeten morgen maar bespreken wat we hiervan willen meenemen en hoe we die classifier daadwerkelijk willen implementeren

fabcouwer commented 10 years ago

Rebased to master

MBrouns commented 10 years ago

@fabcouwer I tried to create a small test case using your tf idf calculator but can't seem to understand how it's supposed to work. Could you help me out with it?

fabcouwer commented 10 years ago

Make a new TfIdf instance with a List of documents represented as String[]s, where each term is a String. This also calculates TF values and puts them in termFrequencies: This is a list of each document, where a document is a String->Double map for term->TF.
Call getTfIdfList to get a list of documents as String->Double maps for term->tf-idf.

MBrouns commented 10 years ago

In that case i'd expect this to work right?

List<String[]> allTerms = new ArrayList<String[]>();
    String[] anArray1 = {"hoi","hoi1","hoi2","hoi2","hoi3","hoi4"};
    String[] anArray2 = {"hoi","hoi2","hoi5","hoi5","hoi6"};

    allTerms.add(anArray1);
    allTerms.add(anArray2);

    TfIdf tfidf = new TfIdf(allTerms);

    for (Map.Entry<String, Double> entry : tfidf.getTfIdfList(1).entrySet()) {
        String key = entry.getKey();
        double value = entry.getValue();

        System.out.println(key + " => " + value);
    }

I get this error

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(Unknown Source)
at java.util.ArrayList.get(Unknown Source)
at main.TfIdf.calculateTermFrequencies(TfIdf.java:77)
at main.TfIdf.<init>(TfIdf.java:23)
at tests.TfIdfTest.main(TfIdfTest.java:23)

fabcouwer commented 10 years ago

I think that's because the entries in termFrequencies aren't initialized.

Try adding for(int i = 0; i < allTermLists.size(); i++){ termFrequencies.add(i, new TreeMap<String,Double>()); }

To the constructor (or to calculateTermFrequencies).

MBrouns commented 10 years ago

Thanks! It still broke a bit in calculateIDF but I fixed it. I might get it in the summarizer for getting the keywords tonight, else it's gonna be tomorrow

fabcouwer commented 10 years ago

:+1:

MBrouns commented 10 years ago

rebased to master and generated jar

merging

karensg / crowd-summary

Summarizers #17