chanind commented 1 month ago

Description

This PR adds a pretokenize runner which can be used to pre-tokenize / chunk / shuffle datasets and upload them to Huggingface. I plan on adding a follow-up PR in the future to harmonize the ActivationsStore to use the same logic for generating batches as is used here, but decided to hold off on that for now to avoid merge conflicts and to keep this PR more easily reviewable.

This PR has several design considerations after discussion in the OSMI slack:

Support specifying a special token to use only at the beginning of every batch
Support specifying a special token to use at the begging of each new sequence
Support specifying a special token to use as a separator between sequences
It should be possible for these tokens to occur one after another if desired (e.g. seq 1 <eos> <bos> seq 2)

The main batching function, concat_and_batch_sequences(), is written as a Python Generator, and thus is an iterator, so it can be used directly inside of ActivationsStore in the future to generate batches of tokens.

This PR also uploads a sae_lens.json metadata file along with the dataset that specifies info about how the dataset was generated, what version of SAELens was used to generate it, special tokens used, context_size, etc... In the future, we can read this data in when loading a pretokenized dataset in ActivationsStore.

This PR also includes a tutorial notebook on how to use this runner.

NOTE The special tokens config for this runner differs from the way our other configs work, as our current configs only have a prepend_bos token option, but not the option to customize this behavior to the extent in this PR. I'm not sure if the names I used for these special token params is ideal, but definitely open to feedback on this! Longer term, we should use the same special token options for the other configs as well.

A sample pretokenized dataset generated with this script can be found here, and the corresponding metadata json can be found here

Fixes #34

Type of change

Please delete options that are not relevant.

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update

Checklist:

[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes
[x] I have not rewritten tests relating to key interfaces which would affect backward compatibility

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

[x] I have run make check-ci to check format and linting. (you can run make format to format code if needed.)

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 84.44444% with 21 lines in your changes are missing coverage. Please review.

Project coverage is 73.03%. Comparing base (085d04f) to head (3187391). Report is 1 commits behind head on main.

Files	Patch %	Lines
sae_lens/training/pretokenize_runner.py	73.33%	15 Missing and 5 partials :warning:
sae_lens/training/batching.py	97.36%	0 Missing and 1 partial :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #148 +/- ## ========================================== + Coverage 72.20% 73.03% +0.83% ========================================== Files 17 19 +2 Lines 1813 1947 +134 Branches 295 320 +25 ========================================== + Hits 1309 1422 +113 - Misses 432 447 +15 - Partials 72 78 +6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jbloomAus commented 1 month ago

This looks great. Looking forward to playing with this.

Some minor notes:

I think we're getting to the point where runners should be brought a level up in order to make it clear what the critical interfaces for the codebase are.
I think we should make a diagram to show people which scripts to run in which order to use the whole pipeline.

(not critical now though).

A good follow up PR:

Make a list of datasets people can use for common models (pretokenized and not pretokenized)
Add to that a list of datasets people have generated and shared with this script.

I'm going to try to make some Gemma-2b and Gemma-7b tokenized datasets with this soon.

jbloomAus / SAELens

Pretokenize runner #148

Description

Type of change

Checklist:

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

Codecov Report