Closed chanind closed 1 month ago
Attention: Patch coverage is 84.44444%
with 21 lines
in your changes are missing coverage. Please review.
Project coverage is 73.03%. Comparing base (
085d04f
) to head (3187391
). Report is 1 commits behind head on main.
Files | Patch % | Lines |
---|---|---|
sae_lens/training/pretokenize_runner.py | 73.33% | 15 Missing and 5 partials :warning: |
sae_lens/training/batching.py | 97.36% | 0 Missing and 1 partial :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This looks great. Looking forward to playing with this.
Some minor notes:
(not critical now though).
A good follow up PR:
I'm going to try to make some Gemma-2b and Gemma-7b tokenized datasets with this soon.
Description
This PR adds a pretokenize runner which can be used to pre-tokenize / chunk / shuffle datasets and upload them to Huggingface. I plan on adding a follow-up PR in the future to harmonize the
ActivationsStore
to use the same logic for generating batches as is used here, but decided to hold off on that for now to avoid merge conflicts and to keep this PR more easily reviewable.This PR has several design considerations after discussion in the OSMI slack:
seq 1 <eos> <bos> seq 2
)The main batching function,
concat_and_batch_sequences()
, is written as a Python Generator, and thus is an iterator, so it can be used directly inside ofActivationsStore
in the future to generate batches of tokens.This PR also uploads a
sae_lens.json
metadata file along with the dataset that specifies info about how the dataset was generated, what version of SAELens was used to generate it, special tokens used, context_size, etc... In the future, we can read this data in when loading a pretokenized dataset inActivationsStore
.This PR also includes a tutorial notebook on how to use this runner.
NOTE The special tokens config for this runner differs from the way our other configs work, as our current configs only have a
prepend_bos
token option, but not the option to customize this behavior to the extent in this PR. I'm not sure if the names I used for these special token params is ideal, but definitely open to feedback on this! Longer term, we should use the same special token options for the other configs as well.A sample pretokenized dataset generated with this script can be found here, and the corresponding metadata json can be found here
Fixes #34
Type of change
Please delete options that are not relevant.
Checklist:
You have tested formatting, typing and unit tests (acceptance tests not currently in use)
make check-ci
to check format and linting. (you can runmake format
to format code if needed.)