juditacs / semeval

MathLing Budapest Team's repo
MIT License
10 stars 9 forks source link

Refactor #44

Closed juditacs closed 9 years ago

juditacs commented 9 years ago

General

  1. The old stuff is kept because not all functions are implemented yet. It will be removed later.
  2. Please do not break the lines to conform with PEP8 until I finish the code, because it makes the debugging a lot harder. I'll do it later.

    Missing stuff

  3. penalties
  4. wordnet boost
  5. similarities other than jaccard/dice
  6. regression support
  7. acronyms, compounds, hunspell - do we need hunspell?

    What you can do

  8. implement the machine similarity (or any other similarity) which should inherit from BaseSimilarity and override its word_sim method. The rest is up to you (caching etc.)
  9. write configs: see config/twitter.cfg
  10. test and see whether the results are better than random results (correllation is significantly higher than 0)
recski commented 9 years ago

Can you please also add a README with instructions on how to run it, including how our submitted outputs (at least the one for Task 1 that didn't use machines) can be recreated?

juditacs commented 9 years ago

The exact same results are not yet produced, because of the missing features. I added a new section to the readme with a very brief description.

recski commented 9 years ago

(new_machine)recski@nessi6:~/sandbox/semeval$ cat semeval_data/sts_test/test_task2a/STS.input.headlines.txt | python semeval/paraphrases.py -c configs/twitter.cfg > out Traceback (most recent call last): File "semeval/paraphrases.py", line 33, in main() File "semeval/paraphrases.py", line 26, in main pairs = reader.read_sentences() File "/home/recski/sandbox/semeval/semeval/read_and_enrich.py", line 21, in read_sentences s1 = self.enricher.add_sentence(sen1, tags1) File "/home/recski/sandbox/semeval/semeval/read_and_enrich.py", line 69, in add_sentence tokens = self.tokenize_and_tag(sentence, tags) File "/home/recski/sandbox/semeval/semeval/read_and_enrich.py", line 100, in tokenize_and_tag self.tag_tokens(tokens) File "/home/recski/sandbox/semeval/semeval/read_and_enrich.py", line 138, in tag_tokens pos_tags = self.hunpos.tag(words) AttributeError: 'Enricher' object has no attribute 'hunpos'

juditacs commented 9 years ago

Yes, I forgot to mention that there are two tagging modes: simple and sts. simple parses the tags from twitter input and sts uses hunpos and nltk ne chunk. Please change the option to sts and the encoding to latin1. I also added additional checking so that instead of failing it adds dummy tags if hunpos is not enabled (simple tagging mode is used).

recski commented 9 years ago

I made the two changes, plus I changed the value of ngrams to 4 so I could reproduce the bare ngram similarity version for some sts data, and yet:

cat semeval_data/sts_test/test_task2a/STS.input.headlines.txt | python src/align_and_penalize.py --sim-type none --batch > headlines_old.out test_evaluation_task2a/correlation-noconfidence.pl test_evaluation_task2a/STS.gs.headlines.txt headlines_old.out Pearson: 0.79843

cat semeval_data/sts_test/test_task2a/STS.input.headlines.txt | python semeval/paraphrases.py -c configs/sts.cfg > headlines_new.out test_evaluation_task2a/correlation-noconfidence.pl test_evaluation_task2a/STS.gs.headlines.txt headlines_new.out Pearson: 0.59869