Generally, it's not a low-hanging fruit to fine-tune language model yet. Better/cheaper techniques are needed.
Creative articulator (CA) project allows synthesizing summary-to-original-text datasets, so one can train network from short plan of the text to the full text. It also contains a basic container that runs the training, like in CoquiTTS
Might be interesting to train network on anime/movies dialogues to better capture the genre. whisper-x allows a diarization.
CA project also contains a pilot research to predict speech modality based on dialogue. Maybe gestures/intonations can be predicted, so in the free conversation the image would naturally react to the conversation's course. Maybe along with the diarization, emotions and gestures can be extracted from the video as well.
Generally, it's not a low-hanging fruit to fine-tune language model yet. Better/cheaper techniques are needed.
Creative articulator (CA) project allows synthesizing summary-to-original-text datasets, so one can train network from short plan of the text to the full text. It also contains a basic container that runs the training, like in CoquiTTS
Might be interesting to train network on anime/movies dialogues to better capture the genre. whisper-x allows a diarization.
CA project also contains a pilot research to predict speech modality based on dialogue. Maybe gestures/intonations can be predicted, so in the free conversation the image would naturally react to the conversation's course. Maybe along with the diarization, emotions and gestures can be extracted from the video as well.