The current pipeline is displayed in the image below.
Some steps that may need to be reconsidered
When extracting metrics (step 3) for both human and ai text, AI is lowercased / cleaned here first, but it could be done in a seperate step and saved / stored. The reason I haven't done this is that the repo will end up a little big.
When using the metrics for classification (step 4B), I only then remove the few faulty generations that are below minimum length. It should ideally be removed prior to this steps 4A and 4B to avoid any mistakes (accidentally including them in other analysis work).
The current pipeline is displayed in the image below.
Some steps that may need to be reconsidered