dailymail_cnn: weird cleaning or weird formatting?

rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text

https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text

2 stars 1 forks source link

dailymail_cnn: weird cleaning or weird formatting? #44

Open MinaAlmasi opened 7 months ago

MinaAlmasi commented 7 months ago

DailyMail

In the meeting yesterday #42, @rdkm89 noticed that the cleaned dailymail_cnn version was weirdly formatted:

richard griffiths laid to rest at holy trinity church in stratford-upon-avon .
daniel radcliffe weeps as he leads tributes to beloved star of withnail and i .
nigel havers, lord fellowes and jack whitehall attend moving ceremony .
richard e. grant sends card: 'to my beloved .
uncle monty, chin chin'

However, our sampled raw data looks the same:

Richard Griffiths laid to rest at Holy Trinity church in Stratford-upon-Avon .
Daniel Radcliffe weeps as he leads tributes to beloved star of Withnail and I .
Nigel Havers, Lord Fellowes and Jack Whitehall attend moving ceremony .
Richard E. Grant sends card: 'To my beloved .
Uncle Monty, chin chin'

Looking at the HF dataset (https://huggingface.co/datasets/cnn_dailymail/viewer/3.0.0/train?q=Richard+Griffiths&row=212018), seems that it is formatted weirdly from the beginning

I'm thinking we can't really fix this - there should ideally not be a dot between the two sentences To my beloved . Uncle Monty but this is a semantic issue. However, I could clean the dataset so that there is no whitespace dots and words (source text is also like this - not just human completion).

Thoughts @rbroc ? (not urgent)

rbroc commented 7 months ago

good catch. this seems like a pretty idiosyncratic issue to me and there may not much we can do in terms of cleaning. but by browsing the dataset, the good news is that this does not seem to be terribly frequent in the dataset. this specific example is good to keep in mind, though, when we choose the features for classification -- i.e., if we use TextDescriptives features that rely on punctuation-based sentence parsing, we might want to keep it mind that this might bias the results. or we might want to go for features that are not sensitive to formatting/punctuation differences.

in general, i think some extra "ad hoc" cleaning will probably be required later on, when we train the classifier, depending on the features we decide to use. another way to go about it could be the following. we might want to ensure (by checking and cleaning data manually) that there are no major formatting differences between human and model-generated data at least on the data we use for the human experiment. then we can also train the ML models on the subset of data we feed humans, to ensure that performance on the larger dataset is not driven by issues like these.

rbroc commented 7 months ago

one thing though related to prompting: if you have not settled on a prompt yet, is there a way we can instruct the model to produce a "highlight-like" summary? if you have already settled on a prompt please ignore this!

MinaAlmasi commented 7 months ago

the good news is that this does not seem to be terribly frequent in the dataset

Yes - I was just about to comment that not all rows in the dailymail_cnn are like this! e.g., the first row of the source col which we feed to the model (it has been abbreviated)

nasa has warned of an impending asteroid pass - and says it will be the closest until 2027. the asteroid, designated 2004 bl86, will safely pass about three times the distance of earth to the moon on january 26. it will be the closest by any known space rock this large until asteroid 1999 an10 flies past earth in 2027. see the asteroid's route below . at the time of its closest approach on january 26, the asteroid will be approximately 745,000 miles (1.2 million kilometers) from earth. due to its orbit around the sun, the asteroid is currently only visible by astronomers with large telescopes who are located in the southern hemisphere. (...)

one thing though related to prompting: if you have not settled on a prompt yet, is there a way we can instruct the model to produce a "highlight-like" summary? if you have already settled on a prompt please ignore this!

I'll be prompting today, so I could try this. However, we settled on doing prompts that were somewhat general across domains e.g., "summarize this" and "paraphrase this" instead of "paraphrase this news article" in our last meeting (#42) with DailyDialog as the exception. Would this not go against this principle?

MinaAlmasi commented 7 months ago

in general, i think some extra "ad hoc" cleaning will probably be required later on, when we train the classifier, depending on the features we decide to use. another way to go about it could be the following. we might want to ensure (by checking and cleaning data manually) that there are no major formatting differences between human and model-generated data at least on the data we use for the human experiment. then we can also train the ML models on the subset of data we feed humans, to ensure that performance on the larger dataset is not driven by issues like these.

Yes! Agree! I have removed <newlines> from stories this morning, but will keep #32 open as it might need further cleaning (that I won't do today as I am mostly concerned with the source column that we feed to the models so I can move on with generating).

I have also added label A: and B: to the human_completions which we then need to add to the AI completions if the model doesn't do it (but which it has a tendency to do)

rbroc commented 7 months ago

I'll be prompting today, so I could try this. However, we settled on doing prompts that were somewhat general across domains e.g., "summarize this" and "paraphrase this" instead of "paraphrase this news article" in our last meeting (#42) with DailyDialog as the exception. Would this not go against this principle?

I think what I have in mind would be more along the lines of providing a better specification of the task (e.g., summarize this in a few sentences/highlights) which would make it less trivial for the classifier. this should be compatible with what we were discussing. but I would only go for this if: a) it works "off-the-shelf", with no further prompt engineering to do; b) @rdkm89 does not think this is a bad idea :]

MinaAlmasi commented 7 months ago

I think what I have in mind would be more along the lines of providing a better specification of the task (e.g., summarize this in a few sentences/highlights) which would make it less trivial for the classifier. this should be compatible with what we were discussing. but I would only go for this if: a) it works "off-the-shelf", with no further prompt engineering to do; b) @rdkm89 does not think this is a bad idea :]

I'll give it a go !

MinaAlmasi commented 1 month ago

Another weird dailydialog thing. In the raw data, some completions have spaces between apostrophes in contraptions, but not all of them have it, so it has not been caught.

Problems here:

{"id": "dailydialog-1876", "source": "you don' t look like enjoying this workout. [EOT] i' m not crazy about it at all. ......."}

But not here:

{"id": "dailydialog-3450", "source": "martha. what's wrong? why are you crying? [EOT] jake just broke up with me. [EOT] i'm sorry. ......"}

(Note that the [EOT] have been replaced with A and B in the cleaned data)