Closed tomMcGrath closed 4 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 72.20%. Comparing base (
085d04f
) to head (18e529c
). Report is 1 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
sorry for the delay.
Description
The current version of
scripts/sweep-gpt2.py
used an incorrect dataset. The dataset was tokenised for a different model, resulting in nonsense tokens being passed to the model. This invalidated earlier SAE training runs.This PR switches to a training dataset that has been tokenised appropriately for GPT-2-S (
apollo-research/Skylion007-openwebtext-tokenizer-gpt2
) and also cleans up some minor nits in the sweep fileFixes # (issue)
Type of change
Please delete options that are not relevant.
Checklist:
You have tested formatting, typing and unit tests (acceptance tests not currently in use)
make check-ci
to check format and linting. (you can runmake format
to format code if needed.)