This PR reworks the TranscriptData class used to process and load the caption data as raised in #19.
It aims to make the class have a similar structure to the ones used for Topic Modelling and Question Generation, including saving the data.
Transcript data being saved doesn't save significant computation power now, but once sentence segmentation is implemented, it will be useful to not reload and reprocess the data from scratch every time.
Changes only significantly affect the configData.py and transcriptLoader.py scripts, with a new .env variable added for the option to overwrite and reprocess transcription data.
Fixes #19
This PR reworks the
TranscriptData
class used to process and load the caption data as raised in #19. It aims to make the class have a similar structure to the ones used for Topic Modelling and Question Generation, including saving the data. Transcript data being saved doesn't save significant computation power now, but once sentence segmentation is implemented, it will be useful to not reload and reprocess the data from scratch every time.Changes only significantly affect the
configData.py
andtranscriptLoader.py
scripts, with a new.env
variable added for the option to overwrite and reprocess transcription data.