martiansideofthemoon / style-transfer-paraphrase

Official code and data repository for our EMNLP 2020 long paper "Reformulating Unsupervised Style Transfer as Paraphrase Generation" (https://arxiv.org/abs/2010.05700).
http://style.cs.umass.edu
MIT License
230 stars 45 forks source link

Fluency classifier issue #44

Open martiansideofthemoon opened 2 years ago

martiansideofthemoon commented 2 years ago

Thanks to Anubhav Jangra for reporting this ---

Email #1

I am unable to replicate the fluency and style accuracy scores. Here are a few numbers I'm getting right now -

Fluency score for AAE Tweets - 10.77 (reported as 56.4 in paper) Fluency score for Bible. - 8.11 (reported as 87.5 in paper) Fluency score for Poetry. - 4.22 (reported as 87.5 in paper) Fluency score for Coha-1810. - 12.33 (reported as 87.5 in paper) Fluency score for Coha-1890. - 21.16 (reported as 87.5 in paper) Fluency score for Coha-1990. - 24.04 (reported as 87.5 in paper)

Just to clarify, I'm using the following script to get the evaluation score - python style_paraphrase/evaluation/scripts/acceptability.py --input datasets/bible/test.input0.txt

(FYI - I also did try to get results for bible/train.input0.txt; it gave a score of 7%)

Also, I got some weird results for style accuracy -

Style Accuracy of "aae/test.input0.txt" against "bible" style - 86.22% Style Accuracy of "aae/test.input0.txt" against "romantic-poetry" style - 5.15% Style Accuracy of "aae/test.input0.txt" against "aae" style - 1.48% (reported as 87.6% in paper)

Also, I've checked the paths for the trained cds-classifier and cola-classifier directories, and they do contain the same content as the one you shared in the Google Drive. (I currently believe that these models might be the issue, but not sure.)

Can you tell me what could be the reason? I wish to replicate the results in the paper before I go ahead with other stuff.

Email #2 & 3

I got the datasets/bible/test.input0.txt by using the datasets/bpe2text.py command over the datasets/bible/test.input0.bpe file. No, I'm not passing the BPEs, and the text to the script directly.

So after fiddling with the fluency script a bit, we found that removing punctuation from input text increases the score; and hence we were wondering if we were missing some explicitly preprocessing step on the text input before it gets converted to bpe with the eval python script. (Currently we are feeding the bpe2text version of the input0.bpe files).