First of all, thank you for your great work to create this project. I didn't have access to a Slurm workload manager, but I was able to use these scripts to preprocess a sample of the GitHub dataset from BigQuery (which was exactly what I wanted to do!). Here are a couple points which would improve the scripts for the next person:
The script scripts/github-prepare-download.sh mentioned in this README.md seems missing from the scripts directory.
The TARGET_DIR variable in the github-global-dedup-slurm.sbatch script should probably be ./data/github/processed_deduped instead of ./data/github_scratch/processed_deduped
Similarly, the TARGET_DIR and DEDUPED_DIR variables in the github-run-filter-slurm.sbatch script should use github instead of github_scratch
First of all, thank you for your great work to create this project. I didn't have access to a Slurm workload manager, but I was able to use these scripts to preprocess a sample of the GitHub dataset from BigQuery (which was exactly what I wanted to do!). Here are a couple points which would improve the scripts for the next person:
scripts/github-prepare-download.sh
mentioned in this README.md seems missing from the scripts directory.TARGET_DIR
variable in the github-global-dedup-slurm.sbatch script should probably be./data/github/processed_deduped
instead of./data/github_scratch/processed_deduped
TARGET_DIR
andDEDUPED_DIR
variables in the github-run-filter-slurm.sbatch script should usegithub
instead ofgithub_scratch
Thanks again for your work on this project.