tomMcGrath commented 4 months ago

Description

The current version of scripts/sweep-gpt2.py used an incorrect dataset. The dataset was tokenised for a different model, resulting in nonsense tokens being passed to the model. This invalidated earlier SAE training runs.

This PR switches to a training dataset that has been tokenised appropriately for GPT-2-S (apollo-research/Skylion007-openwebtext-tokenizer-gpt2) and also cleans up some minor nits in the sweep file

Fixes # (issue)

Type of change

Please delete options that are not relevant.

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update

Checklist:

[x] I have commented my code, particularly in hard-to-understand areas
[ ] I have made corresponding changes to the documentation (n/a)
[x] My changes generate no new warnings
[ ] I have added tests that prove my fix is effective or that my feature works (n/a)
[x] New and existing unit tests pass locally with my changes
[x] I have not rewritten tests relating to key interfaces which would affect backward compatibility

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

[x] I have run make check-ci to check format and linting. (you can run make format to format code if needed.)

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 72.20%. Comparing base (085d04f) to head (18e529c). Report is 1 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #147 +/- ## ======================================= Coverage 72.20% 72.20% ======================================= Files 17 17 Lines 1813 1813 Branches 295 295 ======================================= Hits 1309 1309 Misses 432 432 Partials 72 72 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jbloomAus commented 4 months ago

sorry for the delay.

jbloomAus / SAELens

fix GPT2 sweep settings to use correct dataset #147

Description

Type of change

Checklist:

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

Codecov Report