Preprocess/filter custom data

Hi, thanks for your work! I'm trying to use your method to transfer contemporary german text to the style of the Swiss author Jeremias Gotthelf (19th century). I'm at the first step to train the paraphraser - atm I have 386k backtranslated TED-talk sentences (en translation to ger with T5).

Now I want to filter the backtranslated corpus and by reading #38 I got a first idea. But there are some points I do not yet understand. I try to describe, what I understood till now:

Putting the backtranslation data in a TSV-file and run https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/datasets/prepare_paraphrase_data.py - so I'll get a train and a dev pickle with a data line like: None, None, None, Sentence, BacktranslatedSentence, None, None, None, None Those positions stand for tmp1, tmp2, equality, sent1, sent2, f1_scores, kendall_tau_scores, ed_scores, langid_scores used in https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/datasets/parse_paranmt_postprocess.py.

In your paper you describe in Appendix A.1 the filtering steps. To get the data ready to run with parse_paranmt_postporcess.py, I do have to write my proper script:

Calculate get_kendall_tau() and f1_score() with preprocess_utils.py. Since I only have German sentences, I not need the langid to filter. What are tmp1 and tmp2? Are those sent1 and sent2 normalized? What is ed_scores?
Filter by content: Calculate the similarity measure with test_sim.py and drop results with score lower than 0.5, filter it by length difference with parse_paranmt_postprocess.py(lendiffless), and filter it by length (you propose 7 to 25 tokens). Now I do have the content-filtered dataset.
Lexical diversity filtering: Which SQuAD evaluation scripts did you use? I guess some from here: https://worksheets.codalab.org/worksheets/0xd53d03a48ef64b329c16b9baf0f99b0c Of course I will have to adapt the hard coded English articels aso to German in those scripts. In my script I will filter the results of those scripts here.
Syntactic diversity filtering: I'll take the kendall tau score and filter the dataset with parse_paranmt_postprocess.py (ktless).
LangID filtering: Here's no need to do that.

Open questions are:

What are tmp1 and tmp2 in the dataset?
What is ed_scores?
How to get the lexical diversity with the SQuAD evaluation scripts?

Sorry for the long issue 😄 Thanks for your help in advance!

hi @l0rn0r, Thanks for your interest in our work and your detailed issue describing the points of confusion.

tmp1 and tmp2 are benepar constituency parses of the sentences, as you can see in this file. ed_scores are some kind of edit distance scores between parses.

Most important, neither of tmp1, tmp2, ed_scores were used for filtering the data --- we only used f1_scores, kendall_tau_scores, langid and sentence lengths. So please ignore these fields, and I'm sorry for the confusion they may have caused.

How to get the lexical diversity with the SQuAD evaluation scripts?

Use this function. We used the precision in the paper, but I think f1_score is more appropriate if you don't have any length bias in your data like paraNMT (paraNMT is notorious for dropping content).

Kendall Tau

Use this function.

Please feel free to reopen if you have more questions

martiansideofthemoon / style-transfer-paraphrase

Preprocess/filter custom data #46