jbloomAus / SAELens

Training Sparse Autoencoders on Language Models
https://jbloomaus.github.io/SAELens/
MIT License
490 stars 127 forks source link

chore: reduce test space usage in CI #336

Closed chanind closed 1 month ago

chanind commented 1 month ago

Description

CI has started failing since merge #320 due to running out of space. It looks like this is due to loading and processing large datasets (c4-tokenized-2b). This PR replaces that dataset with a tiny tokenized version of c4-10k: https://huggingface.co/datasets/chanind/c4-10k-mini-tokenized-16-ctx-gelu-1l-tests. This is a tokenized version of the first 1k rows of the c4-10k dataset. It's split into 64 pieces, and the total dataset size is onlly 250kb (vs 2gb for c4-tokenized-2b)

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 63.97%. Comparing base (ff335f0) to head (a602f56). Report is 2 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #336 +/- ## ======================================= Coverage 63.97% 63.97% ======================================= Files 25 25 Lines 3223 3223 Branches 408 408 ======================================= Hits 2062 2062 Misses 1052 1052 Partials 109 109 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

chanind commented 1 month ago

merging as this should be uncontroversial and CI is currently failing due to space issues.