jbloomAus / SAELens

Training Sparse Autoencoders on Language Models
https://jbloomaus.github.io/SAELens/
MIT License
386 stars 106 forks source link

fix GPT2 sweep settings to use correct dataset #147

Closed tomMcGrath closed 4 months ago

tomMcGrath commented 4 months ago

Description

The current version of scripts/sweep-gpt2.py used an incorrect dataset. The dataset was tokenised for a different model, resulting in nonsense tokens being passed to the model. This invalidated earlier SAE training runs.

This PR switches to a training dataset that has been tokenised appropriately for GPT-2-S (apollo-research/Skylion007-openwebtext-tokenizer-gpt2) and also cleans up some minor nits in the sweep file

Fixes # (issue)

Type of change

Please delete options that are not relevant.

Checklist:

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 72.20%. Comparing base (085d04f) to head (18e529c). Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #147 +/- ## ======================================= Coverage 72.20% 72.20% ======================================= Files 17 17 Lines 1813 1813 Branches 295 295 ======================================= Hits 1309 1309 Misses 432 432 Partials 72 72 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jbloomAus commented 4 months ago

sorry for the delay.