Closed l0rn0r closed 2 years ago
hi @l0rn0r, Thanks for your interest in our work and your detailed issue describing the points of confusion.
tmp1
and tmp2
are benepar
constituency parses of the sentences, as you can see in this file. ed_scores
are some kind of edit distance scores between parses.
Most important, neither of tmp1
, tmp2
, ed_scores
were used for filtering the data --- we only used f1_scores, kendall_tau_scores
, langid
and sentence lengths. So please ignore these fields, and I'm sorry for the confusion they may have caused.
How to get the lexical diversity with the SQuAD evaluation scripts?
Use this function. We used the precision
in the paper, but I think f1_score
is more appropriate if you don't have any length bias in your data like paraNMT (paraNMT is notorious for dropping content).
Kendall Tau
Use this function.
Please feel free to reopen if you have more questions
Hi, thanks for your work! I'm trying to use your method to transfer contemporary german text to the style of the Swiss author Jeremias Gotthelf (19th century). I'm at the first step to train the paraphraser - atm I have 386k backtranslated TED-talk sentences (en translation to ger with T5).
Now I want to filter the backtranslated corpus and by reading #38 I got a first idea. But there are some points I do not yet understand. I try to describe, what I understood till now:
None, None, None, Sentence, BacktranslatedSentence, None, None, None, None
Those positions stand fortmp1, tmp2, equality, sent1, sent2, f1_scores, kendall_tau_scores, ed_scores, langid_scores
used in https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/datasets/parse_paranmt_postprocess.py.In your paper you describe in
Appendix A.1
the filtering steps. To get the data ready to run withparse_paranmt_postporcess.py
, I do have to write my proper script:get_kendall_tau()
andf1_score()
withpreprocess_utils.py
. Since I only have German sentences, I not need thelangid
to filter. What aretmp1
andtmp2
? Are thosesent1
andsent2
normalized? What ised_scores
?similarity measure
withtest_sim.py
and drop results with score lower than 0.5, filter it by length difference withparse_paranmt_postprocess.py
(lendiffless), and filter it by length (you propose 7 to 25 tokens). Now I do have the content-filtered dataset.parse_paranmt_postprocess.py
(ktless).Open questions are:
tmp1
andtmp2
in the dataset?ed_scores
?Sorry for the long issue 😄 Thanks for your help in advance!