Datasets - Githubissues

rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text

https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text

2 stars 1 forks source link

Datasets #2

Open rbroc opened 1 year ago

rbroc commented 1 year ago

(All these tasks will probably require prompt engineering, model-specific. Consider doing evaluation, either through external metrics or human validation)

Number of examples per dataset: cap at 5000 (expand if possible)

Paraphrasing: MRPC: paraphrases, 5,801 examples

Summarizaton: DailyMail / CNN: 300,000 examples (sample 5 for iteration)

Dialogue: DailyDialog: multi-turn dialogues (15k). Approach:

Randomly sample number of turns fed as context to the model.
Iteratively pass examples with increasing number of turns as context

Socratic Questions (context - smart question (HG) - AI generated (HG)) https://aclanthology.org/2023.eacl-main.12.pdf

Story Generation GitHub: One GitHub for story generation: https://github.com/facebookresearch/fairseq/blob/main/examples/stories/README.md Kaggle: Writing Prompts https://www.kaggle.com/datasets/ratthachat/writing-prompts

Additional datasets GEM: https://aclanthology.org/2021.gem-1.10.pdf

rbroc commented 1 month ago

I've taken a look at additional datasets, and I think in terms of paraphrase, dialogue generation and story generation, I think we can run with the datasets we have. In theory, if we wanted to make bigger claims about differences between LLMs and humans for each task, we would have to have multiple datasets per task. For summarization, this is a possibility, and we could in theory consider adding https://huggingface.co/datasets/EdinburghNLP/xsum or https://huggingface.co/datasets/Samsung/samsum because they come with detailed instructions for humans that can be also provided to models.

Yet, I think we should try to wrap up the project and write it up with what we have at the moment. I am leaving this open for now just in case, but the only action needed might be to use a better prompt for summarization, to simulate the highlight-like behavior. We could also consider adding a few examples from the dataset to illustrate things.

rbroc commented 1 month ago

Regarding prompts, a few suggestions for improvements in the final version. For summarization, I would go for something along the lines of: Summarise the following news article. Provide your summary in the form of highlights. If models do not comply, we can provide an example, adding:

Here is an example:

Text: {text_1}
Summary: {summary_1}

Text: {target_text}
Summary:

For stories, I would go for something like: Write a short story based on this writing prompt.

For paraphrase and dialogue, I think we are good.