Pipeline code to filter the PARANMT-50M corpus - Githubissues

martiansideofthemoon / style-transfer-paraphrase

Official code and data repository for our EMNLP 2020 long paper "Reformulating Unsupervised Style Transfer as Paraphrase Generation" (https://arxiv.org/abs/2010.05700).

http://style.cs.umass.edu

MIT License

228 stars 45 forks source link

Pipeline code to filter the PARANMT-50M corpus #38

Closed FatemehMashhadi closed 2 years ago

FatemehMashhadi commented 2 years ago

Is there any pipeline code to filter the PARANMT-50M corpus?

martiansideofthemoon commented 2 years ago

Hi @FatemehMashhadi, we've added the code for reference only.

I've added the main script here: https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/datasets/parse_paranmt_postprocess.py The command to run it is on top of the script (don't use default values).

This code operates on cached data, which was generated using the get_kendall_tau() and f1_score functions here: https://github.com/martiansideofthemoon/style-transfer-paraphrase/blob/master/datasets/preprocess_utils.py. You will need to write a script to run it (it was sort of slow so I had parallelized it on the cluster). The scripts for paraphrase similarity can be found here: https://github.com/martiansideofthemoon/style-transfer-paraphrase/tree/master/style_paraphrase/evaluation#similarity

If you need the cached data as well, please drop me an email at kalpesh@cs.umass.edu