jensroes / prowrite-mixture-models

A comparison of probability models of text writing process data
GNU General Public License v3.0
0 stars 0 forks source link

Describing data sets #4

Open jensroes opened 1 year ago

jensroes commented 1 year ago

I need to outsource some work :) I need help with the description of the data sets in the mixture models paper. Attached is my current draft. You can ignore everything else if you want. Just look at the data sets section.

manuscript.pdf

I've written a little bit about each data sets but you could help me here because you know the data sets better. I want to include at least information about the sample and the writing task. It should be clear what the participants had to do and who they are. I'm also wondering if we should add information about the aim of the study associated with the data sets and main findings but that's probably going to far. What do you think?

I've mentioned your names (@Mark-Torrance and @RConijn) in the data set sections where I need your input.

Feel free to just paste your text in this issue; I can add it to the paper.

RConijn commented 1 year ago

The LIFT data are published in Vandermeulen et al. (2020a) and described in Vandermeulen et al., (2020b). The primary aim of this dataset was to create a national baseline on synthesis writing in Dutch secondary education, including student's text quality, writing process, and perspectives on writing. Within this national survey, a representative sample of Dutch students (N = 658, mean age = 16.95 years, 428 females and 230 males) in the three highest grades of pre-university education (grades 10, 11, and 12) in the Netherlands was collected from 43 schools. The students first received instruction on synthesis writing, after which they were asked to conduct two synthesis tasks, with small breaks in between, thereafter students had a longer lunch break, followed by a survey on writing perspectives and again two synthesis tasks with a small break in between. In the four tasks, students were asked to write two argumentative and two informative texts on laptop, about each of four topics (food additives, self-driving cars, the human-wildlife conflict in Africa, and the pay gap), with order randomized per school. The students received 50 minutes for each tasks. Not all students conducted all four tasks, resulting in a final sample of 2310 synthesis texts. During the synthesis tasks keystroke data were captured using InputLog (Leijten & Van Waes, 2013; Van Waes et al., 2019, 2021). Afterwards, all texts were assessed on writing quality.

Vandermeulen, N., Steendam, E. V., & Rijlaarsdam, G. (2020a). DATASET - Baseline data LIFT Synthesis Writing project [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3893538

Vandermeulen, N., De Maeyer, S., Van Steendam, E., Lesterhuis, M., Van den Bergh, H., & Rijlaarsdam, G. (2020b). Mapping synthesis writing in various levels of Dutch upper-secondary education: a national baseline study on text quality, writing process and students' perspectives on writing. Pedagogische studiën: tijdschrift voor onderwijskunde en opvoedkunde, 97(3), 187-236.

jensroes commented 1 year ago

Thanks :)

Vandermeulen et al. (2020) and described in Vandermeulen et al., (2020).

Are these the same papers?

Afterwards, all texts were assessed on writing quality.

What does this mean?

RConijn commented 1 year ago

Are these the same papers?

No sorry - I just updated this

What does this mean?

Not sure if we want to mention this, but they had a whole complex procedure of assessing all final texts (including multiple raters, compartive judgments, multiple aspects of writing quality, etc.). As we do not use the writing quality data, I did not want to write much about it (same goes for the writing perspectives)

jensroes commented 1 year ago

Thanks. No, I can't see why we need text quality information for this.

RConijn commented 1 year ago

Good. I only mentioned it because students were aware they were assessed (although it did not affect their grades in any way)

RConijn commented 1 year ago

The PLanTra (Plain Language Training for business content) data are published in (Rossetti & Van Waes, 2022a) and described in (Rossetti & Van Waes, 2022b). The primary aim of the research project was to investigate the impact of plain language instruction on business students' strategies to simplify business texts as well as on the comprehensibility of the produced texts. A total of 47 graduate students (mean age = 23 years, 38 females and 9 males, 45 native Dutch speakers) of the master Business and Economics participated. The study adopted a pre-test post-test design. As pre-test, participants were asked to rewrite a given text (extract of a corporate report on sustainability), to make it more engaging and easier to read for a lay audience. Thereafter the experimental group received online instruction on how to apply plain language principles to sustainability content, while the control group received online instruction exclusively on the topic of sustainability. Participants were asked to spend at least 45 minutes on the instruction module. As post-test, participants were asked 2-3 days later to simplify another extract of a corporate report on sustainability. Both reports were written in English (second language) and similar in length (274-278 words) and readability. Participants received as much time as needed for each task. During the task, keystroke data were captured using InputLog (Leijten & Van Waes, 2013; Van Waes et al., 2019, 2021). Afterwards, readability scores were calculated for both texts.

Rossetti, A., & Van Waes, L. (2022a). Text simplification in second language: process and product data [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6720290

Rossetti, A., & Van Waes, L. (2022b). It’s not just a phase: Investigating text simplification in a second language from a process and product perspective. Frontiers in Artificial Intelligence, 5. https://doi.org/10.3389/frai.2022.983008

jensroes commented 5 months ago

@Mark-Torrance can you provide some details for the Gunn dataset? What's the sample demographic, what did they have to do in terms of writing? I know that there was a masked and an unmasked condition but that's it really.

RConijn commented 5 months ago

@jensroes I commented on the intro just now (only some very minor things). I'll have a look at the remainder later today