Closed nalmadi closed 3 months ago
Implemented a better skipping synthetic data generator that replicates data from Brysbaert & Vitu (1998), the resulting skip probability looks like (but it can be controlled by the user):
I have no idea how to do the same for regressions, I can't find a paper with a distribution or a figure that I can use.
the method for generating data about word skipping is based on a probability determined by the user to randomly skip words at a certain rate, if I understand it correctly. Given that word skipping rates are primarily influenced by word length, I wonder if it would be more beneficial to consider word length in generating word skipping data. If word skipping is based entirely on a random rate applied equally to all words, it is likely that the skipping patterns will not be realistic (e.g., more skips on longer words, fewer skips on short words than typical). Brysbaert & Vitu (1998) have some helpful data on typical skipping probabilities across different word lengths.
The same thing is true for generating regression probabilities.