jpwahle / emnlp22-transforming

The official implementation of the EMNLP 2022 paper "How Large Language Models are Transforming Machine-Paraphrased Plagiarism".
https://aclanthology.org/2022.emnlp-main.62/
10 stars 0 forks source link

How could we get the realtion between GPT-3/T5 and orginal text? #1

Closed fengsxy closed 1 year ago

fengsxy commented 1 year ago

I have seen the dataset, it has the oginal, GPT, T5. But i couldn't find the orignal text and its paraprased sentences. Is it possible to find the mapping relationship?

jpwahle commented 1 year ago

Hi there, sure we will add the ids of the source document to each paraphrase pair. Since the final task is single-sentence classification, we did not include them in the 🤗 HF dataset. Soon there will be another release on zenodo similar to this one here in which all ids are present.

fengsxy commented 1 year ago

Hi there, sure we will add the ids of the source document to each paraphrase pair. Since the final task is single-sentence classification, we did not include them in the 🤗 HF dataset. Soon there will be another release on zenodo similar to this one here in which all ids are present.

Thank you for you reply. I have reviewed the proposed url. I am confused that why the original's number is not equal to the paraphrased number.

jpwahle commented 1 year ago

Hi there 👋🏻 In the dataset that I sent, we don't have aligned examples. We extracted all paragraphs from the article (e.g., arXiv) and paraphrased ~50% with either method. Therefore, in this case, the number of paraphrases and originals is not exactly equal (but certainly similar). In other datasets, we did it differently, and the amount is exactly equal. I hope this answers your question. Let me know if I can support you further in using the datasets.

fengsxy commented 1 year ago

Hi there 👋🏻 In the dataset that I sent, we don't have aligned examples. We extracted all paragraphs from the article (e.g., arXiv) and paraphrased ~50% with either method. Therefore, in this case, the number of paraphrases and originals is not exactly equal (but certainly similar). In other datasets, we did it differently, and the amount is exactly equal. I hope this answers your question. Let me know if I can support you further in using the datasets.

Thank you for your patient answer. I am very looking forward to this dataset. Beacuse i am inspired by your paper and want to try a relevant work. Therefore, i want to ask for the date about public the GPT mapping paraphrase dataset. And would the dataset divided into arxiv/wiki/... Or they just have been mixed together.

And i have another question, i find that prompt in your paper has "change the passage structure", i have do similar experiment on GPT-3(Fine-tune it to paraphrase), but it couldn't be as strong as i image. So would the prompt "change the passage structure" work in your experiment?

jpwahle commented 1 year ago

Hi there, about the annotation of the dataset type (wiki, arxiv, thesis), we will also add that soon. For now, you can get started with these two datasets that have dataset annotations 1 and 2.

About the prompts, some do work better than others. It really depends. Since we send this paper to the conference, there have been many papers discussing this issue. I would recommend you generate multiple prompts and choose the Pareto optimal candidates. This way you can also trace back which of the prompts were actually strong.

fengsxy commented 1 year ago

Hello, have the new dataset been provided? Because the GPT-3 generated text has high quality and costs a lot. 🤗

Hi there, sure we will add the ids of the source document to each paraphrase pair. Since the final task is single-sentence classification, we did not include them in the 🤗 HF dataset. Soon there will be another release on zenodo similar to this one here in which all ids are present.