jbloomAus / SAELens

Training Sparse Autoencoders on Language Models
https://jbloomaus.github.io/SAELens/
MIT License
193 stars 67 forks source link

Pretokenize runner #148

Closed chanind closed 1 month ago

chanind commented 1 month ago

Description

This PR adds a pretokenize runner which can be used to pre-tokenize / chunk / shuffle datasets and upload them to Huggingface. I plan on adding a follow-up PR in the future to harmonize the ActivationsStore to use the same logic for generating batches as is used here, but decided to hold off on that for now to avoid merge conflicts and to keep this PR more easily reviewable.

This PR has several design considerations after discussion in the OSMI slack:

The main batching function, concat_and_batch_sequences(), is written as a Python Generator, and thus is an iterator, so it can be used directly inside of ActivationsStore in the future to generate batches of tokens.

This PR also uploads a sae_lens.json metadata file along with the dataset that specifies info about how the dataset was generated, what version of SAELens was used to generate it, special tokens used, context_size, etc... In the future, we can read this data in when loading a pretokenized dataset in ActivationsStore.

This PR also includes a tutorial notebook on how to use this runner.

NOTE The special tokens config for this runner differs from the way our other configs work, as our current configs only have a prepend_bos token option, but not the option to customize this behavior to the extent in this PR. I'm not sure if the names I used for these special token params is ideal, but definitely open to feedback on this! Longer term, we should use the same special token options for the other configs as well.

A sample pretokenized dataset generated with this script can be found here, and the corresponding metadata json can be found here

Fixes #34

Type of change

Please delete options that are not relevant.

Checklist:

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 84.44444% with 21 lines in your changes are missing coverage. Please review.

Project coverage is 73.03%. Comparing base (085d04f) to head (3187391). Report is 1 commits behind head on main.

Files Patch % Lines
sae_lens/training/pretokenize_runner.py 73.33% 15 Missing and 5 partials :warning:
sae_lens/training/batching.py 97.36% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #148 +/- ## ========================================== + Coverage 72.20% 73.03% +0.83% ========================================== Files 17 19 +2 Lines 1813 1947 +134 Branches 295 320 +25 ========================================== + Hits 1309 1422 +113 - Misses 432 447 +15 - Partials 72 78 +6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jbloomAus commented 1 month ago

This looks great. Looking forward to playing with this.

Some minor notes:

  1. I think we're getting to the point where runners should be brought a level up in order to make it clear what the critical interfaces for the codebase are.
  2. I think we should make a diagram to show people which scripts to run in which order to use the whole pipeline.

(not critical now though).

A good follow up PR:

I'm going to try to make some Gemma-2b and Gemma-7b tokenized datasets with this soon.