rmusser01 / tldw

Too Long, Didn't Watch(TL/DW): Your Personal Research Multi-Tool - Open Source NotebookLM
Apache License 2.0
45 stars 2 forks source link

Improvement: Add user-defined token size chunking for summarization #15

Closed rmusser01 closed 1 month ago

rmusser01 commented 1 month ago

A user may have a transcription that is longer than 120k tokens. If so, they should not be expected to break up the transcription themselves. The script should be able to automatically chunk and summarize the entirety of all chunks through some method.

https://github.com/openai/openai-cookbook/blob/main/examples/Summarizing_long_documents.ipynb The above demonstrates an example.

From Issue: 24 As a user, I would like to be able to specify token count sizes, which are cut out of the transcription, and then summarized in piece. These summaries are then strung together, or re-summarized together as one.

When using the CLI, I should be able to pass an argument so that summarization will occur based on the token count of the transcription, and not based on the entirety of the original transcription. CLI arg:

'--chunk-tokens' / '-ctokens'

The resulting 'chunks' should be user definable and determined through the following command:

'--summary_detail' / `-sd' - Token-count

If the '--token-count' / '-tc' arguments are passed, but the '--summary-detail' or '-sd' arguments are not, then a default assumption of 1 detail is assumed, and used instead.

Edit: Spltitting time-based chunking off into its own issue vs token-size chunking.

rmusser01 commented 1 month ago

Need to also specify the API to be used for summarization due to tokenization requirements. Also chestertons fence. Will implement OpenAI first, based off their example. Plan to add different tokenizer support for different APIs.

rmusser01 commented 1 month ago

OpenAI currently supported for chunked summarization. Need to implement huggginface autotokenizer for use with others.

rmusser01 commented 1 month ago

Option for detailed summarization now enabled/configured in the UI. Hidden by default and will show if toggled.