rhiever / sklearn-benchmarks

A centralized repository to report scikit-learn model performance across a variety of parameter settings and data sets.
MIT License
209 stars 53 forks source link

Dataset compression #2

Closed rasbt closed 8 years ago

rasbt commented 8 years ago

I would suggest compressing the CSV files to avoid this repo to become to bloated. E.g., tools like pandas can conveniently read gzip-ed files, and the difference for text data is typically huge, like 20x - 50x smaller file sizes.

Edit: that is, if you use the --best flag in gzip

rasbt commented 8 years ago

Cool. Btw. did you remove the old CSV files from the git history? If not, just wanted to drop a handy script here:

#!/bin/bash
set -o errexit

# Author: David Underhill
# Script to permanently delete files/folders from your git repository.  To use 
# it, cd to your repository's root and then run the script with a list of paths
# you want to delete, e.g., git-delete-history path1 path2

if [ $# -eq 0 ]; then
    exit 0
fi

# make sure we're at the root of git repo
if [ ! -d .git ]; then
    echo "Error: must run this script from the root of a git repository"
    exit 1
fi

# remove all paths passed as arguments from the history of the repo
files=$@
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD

# remove the temporary history git-filter-branch otherwise leaves behind for a long time
rm -rf .git/refs/original/ && git reflog expire --all &&  git gc --aggressive --prune
rhiever commented 8 years ago

Okay, I think that's done. Thanks!