martiansideofthemoon / style-transfer-paraphrase

Official code and data repository for our EMNLP 2020 long paper "Reformulating Unsupervised Style Transfer as Paraphrase Generation" (https://arxiv.org/abs/2010.05700).
http://style.cs.umass.edu
MIT License
228 stars 45 forks source link

Preprocess/filter custom data #46

Closed l0rn0r closed 2 years ago

l0rn0r commented 2 years ago

Hi, thanks for your work! I'm trying to use your method to transfer contemporary german text to the style of the Swiss author Jeremias Gotthelf (19th century). I'm at the first step to train the paraphraser - atm I have 386k backtranslated TED-talk sentences (en translation to ger with T5).

Now I want to filter the backtranslated corpus and by reading #38 I got a first idea. But there are some points I do not yet understand. I try to describe, what I understood till now:

In your paper you describe in Appendix A.1 the filtering steps. To get the data ready to run with parse_paranmt_postporcess.py, I do have to write my proper script:

Open questions are:

Sorry for the long issue 😄 Thanks for your help in advance!

martiansideofthemoon commented 2 years ago

hi @l0rn0r, Thanks for your interest in our work and your detailed issue describing the points of confusion.

tmp1 and tmp2 are benepar constituency parses of the sentences, as you can see in this file. ed_scores are some kind of edit distance scores between parses.

Most important, neither of tmp1, tmp2, ed_scores were used for filtering the data --- we only used f1_scores, kendall_tau_scores, langid and sentence lengths. So please ignore these fields, and I'm sorry for the confusion they may have caused.

How to get the lexical diversity with the SQuAD evaluation scripts?

Use this function. We used the precision in the paper, but I think f1_score is more appropriate if you don't have any length bias in your data like paraNMT (paraNMT is notorious for dropping content).

Kendall Tau

Use this function.

Please feel free to reopen if you have more questions