Open rbroc opened 1 month ago
I am very much on the fence about what to do here, and I can use some input. One of my concerns with the full feature set for TextDescriptives
is that there are features that might be sensitive to systematic differences in the use of punctuation across human and AI-generated text. These differences are quite trivial, and we don't want our models to learn from them. Another thing I am in doubt about is whether to augment/replace the feature set we have with more "cognitive features".
Could you (@rdkm89, @yuri-bizzoni, and of course @MinaAlmasi :) ) help me decide which one of the following options is best here:
TextDescriptives
only, excluding their Quality
features (https://hlasse.github.io/TextDescriptives/quality.html) and any other feature you think might indirectly be affected by trivial things like punctuation and casing;TextDescriptives
(again, minus trivial features), augmented with SentSpace
features, which include more "cognitive" features (that should not be influenced by trivial things like punctuation);TextDescriptives
and going for SentSpace
only, focusing on cognitive features (which could be nice for a CMCL workshop paper)I might be leaning towards the third option, but am very interested in hearing what you think.
I've become a little disenchanted with TextDescriptives
recently. I think there is too much redundancy, some of the features are quite trivial, and I still think that some of the methods are not showing what they purport to show (first- and second-order coherence, for example). Moreover, with the exception of maybe mean dependency distance, I don't think there is any reason to assume that there will be significant variation between synthetic vs human completions - no more than the expected within-group variation for humans anyway.
However, I think it depends somewhat on the motivation. Is the goal to create the best-performing predictive model? Or is it to describe differences in ways which meaningful from a psycholinguisitic perspective? If the former, then I guess combining features and paring down will give the best results; if the latter, then sentSpace
is probably the way to go.
I personally think that the feature set in sentSpace
is more plausible, i.e. it models more closely what readers are sensitive to during sentence processing. So I'm also inclined towards the third option, because I think it's just more interesting in general.
Having spent quite a while looking at the features, I agree with many of the concerns that @rdkm89 points out when it comes to TextDescriptives
(esp. redundancy).
Thanks a lot for the input @rdkm89! I think we'd stand strong in the paper if we could achieve somewhat competitive performance (e.g., compared to embedding or LLM-based models) with a very lightweight feature set + describing differences from a psycholinguistic perspective. There is relevant signal, though, in TextDescriptives features (i.e., the models Mina has been fitting using these features do show high classification performance), but I am not sure we can learn much from them in terms of explainability (and I am afraid they are picking up on trivial things).
I'd suggest we still extract both, but we run with SentSpace (or the like) only as a default, only adding TextDescriptives if we want to make a point about increasing performance.
I've noticed that the SentSpace repo has been archived by the authors, but I just got in touch with them to ask if it's just a maintenance thing or whether they believe there is something fundamentally outdated about their approach. It's likely the former, but if it turns out to be the latter we can maybe resort to aggregating dictionary-based word-level features as we do in pliers.
More soon!
Regarding Sentspace (@rbroc) - it says "please use our updated fork" on their old readme:
Updated fork: https://github.com/sentspace/sentspace
update: tried to play with sentspace, installation is hell (not pip-installable + requires python3.8) - @MinaAlmasi is on it, but we might need to talk about this next week.