Problems with length of oracle output

tbrodbeck commented 4 years ago

The example command in the description is not working: $ extoracle source.txt target.txt -method greedy -output oracle.txt

Traceback (most recent call last):
  File "/Users/tillmann/dev/textSum/.direnv/python-3.7.3/bin/extoracle", line 10, in <module>
    sys.exit(main())
  File "/Users/tillmann/dev/textSum/.direnv/python-3.7.3/lib/python3.7/site-packages/extoracle/bin/cmd.py", line 34, in main
    n_thread=args.n_thread)
  File "/Users/tillmann/dev/textSum/.direnv/python-3.7.3/lib/python3.7/site-packages/extoracle/extoracle.py", line 82, in from_files
    "Argument [summary_length, length_oracle] "
ValueError: Argument [summary_length, length_oracle] cannot be both None/False

Also the length is very long in my outputs. Even when I specify -length 1 or -length_oracle. Furthermore, I am not sure if this works with other languages than English - because then it even returned double the amount of sentences of the source.

$ pip freeze | grep extoracle      
extoracle==0.1

pltrdy commented 4 years ago

Thanks for pointing this out.

First, you're right, one must set -length or -length_oracle, I will update README accordingly. There's no consideration of language, it just compares words (i.e. virtually anything separated with whitespaces). However, it is looking for sentences, which should be terminated by a period «.».

If you set -length 1 each output should be at most 1 sentence long. Unless your text has no «.» it should be fine. Could you share an extract so I can reproduce?

tbrodbeck commented 4 years ago

Sure. Thanks for your help. I just created this minimum example with the following files: sourceTest.txt targetTest.txt The following commands

extoracle sourceTest.txt targetTest.txt -length 1 -output oracleTest.txt
extoracle sourceTest.txt targetTest.txt -length_oracle -output oracleTest.txt

both give the same output (basically just a copy of sourceTest.txt): oracleTest.txt

pltrdy commented 4 years ago

Alright, the thing is your texts are not tokenized. It is quite common before NLP tasks to tokenize text so it's easier to process. In particular, it will add whitespace around punctuation etc so it's easy to distinct from words. E.g. your source becomes: this actually happened a couple of years ago . i grew up in germany where i went to a german secondary school that went from 5th to 13th grade -LRB- we still had 13 grades then , they have since changed that -RRB- . my school was named after anne frank and we had a club that i was very active in from 9th grade on , which was dedicated to teaching incoming 5th graders about anne franks life , discrimination , anti-semitism , hitler , the third reich and that whole spiel . basically a day where the students ' classes are cancelled and instead we give them an interactive history and social studies class with lots of activities and games . this was my last year at school and i already had a lot of experience doing these project days with the kids . i was running the thing with a friend , so it was just the two of us and 30-something 5th graders . we start off with a brief introduction and brainstorming : what do they know about anne frank and the third reich ? you 'd be surprised how much they know . anyway after the brainstorming we do a few activities , and then we take a short break . after the break we split the class into two groups to make it easier to handle . one group watches a short movie about anne frank while the other gets a tour through our poster presentation that our student group has been perfecting over the years . then the groups switch . i 'm in the classroom to show my group the movie and i take attendance to make sure no one decided to run away during break . i 'm going down the list when i come to the name sandra -LRB- name changed -RRB- . a kid with a boyish haircut and a somewhat deeper voice , wearing clothes from the boy 's section at a big clothing chain in germany , pipes up . now keep in mind , these are all 11 year olds , they are all pre-pubescent , their bodies are not yet showing any sex specific features one would be able to see while they are fully clothed.

I can suggest you to use Stanford CoreNLP Tokenizer i.e.

# Get Stanford CoreNLP
mkdir ~/corenlp
cd ~/corenlp
wget http://nlp.stanford.edu/software/stanford-corenlp-latest.zip
unzip stanford-corenlp-latest.zip

# change with your version number
export CLASSPATH=$HOME/corenlp/stanford-corenlp-X.Y.Z.jar

tokenize(){
    # Tokenize function using Stanford PTB Tokenizer
    path="$1"
    java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines \
        < "$path" \
        > "$path.tok"
}

tokenize "sourceTest.txt"
tokenize "targetTest.txt"

# Finally run extoracle on tokenized text
extoracle sourceTest.txt.tok targetTest.txt.tok -length_oracle -output oracleTest.txt.tok

tbrodbeck commented 4 years ago

Thank you! It was not clear for me that the text had to be tokenized in that manner. You could think about adding this to the README as well 👍

As a remark, my shell did not like that I changed the $path variable, so I had to change the function to this one:

tokenize(){                                                                         
    # Tokenize function using Stanford PTB Tokenizer
    java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines \
        < "$1" \
        > "$1.tok"
}

pltrdy / extoracle_summarization

Problems with length of oracle output #1