simonw / ttok

Count and truncate text based on tokens
Apache License 2.0
247 stars 7 forks source link

Add split option to split large input into smaller parts instead of truncating #2

Open c4pt0r opened 1 year ago

c4pt0r commented 1 year ago

Sometimes, I don't want to discard the remaining output, might be useful :) So, this pull request adds a new feature to the token counting command-line interface. The new -s or --split option allows users to split their input text into multiple files, each containing a specific number of tokens.

Changes

  1. A new option (-s, --split) has been added to the click.command() decorator in the cli function.
  2. The cli function has been updated to handle this new option. When the split and truncate options are used together, the script splits the input tokens into groups of the specified size and writes each group to a separate file.
  3. A new helper function grouper has been added to split an iterable into fixed-length chunks.
  4. The split function now decodes each chunk before writing to the respective output file.
c4pt0r commented 1 year ago

By the way, thank you for the great handy tools (llm / ttok / strip-tags).

simonw commented 1 year ago

I opened an issue to discuss this here: