Closed shayne-longpre closed 5 months ago
@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the to_dolma
call as I'm unable to pip install dolma
for some reason... any ideas?
Also, how do we run the default linter?
@shayne-longpre run pre-commit
@shayne-longpre run pre-commit
Could you direct me on how to run that?
pip install pre-commit
pre-commit install
pre-commit run -a
pip install pre-commit
pre-commit install
pre-commit run -a
Done. Thanks!
We should just setup a pre-commit.ci here to do autoformat PRs. I'd be happy to do it if I can get proper permissions
We should just setup a pre-commit.ci here to do autoformat PRs. I'd be happy to do it if I can get proper permissions
We already have a CI that blocks merges until the formatting is correct (uses black and isort), so I don't think we need to have the CI actually re-write commits. The tools are pretty reliable, but I think using pre-commit (and thus having people need to spot check formatting changes by adding them back to git staging) is less error prone.
@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the
to_dolma
call as I'm unable topip install dolma
for some reason... any ideas?
Still hitting this issue cc @blester125
@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the
to_dolma
call as I'm unable topip install dolma
for some reason... any ideas?Still hitting this issue cc @blester125
:thinking: What version of things are you using? I'm on python3.11 and I was able to install both dolma 0.9.1 and dolma 1.0.3 on ubuntu 22.04 and 23.04 respectively.
Maybe try to see if you can bump your python version so that there is a pre-built wheel for it? The list of wheels are here https://pypi.org/project/dolma/#files
@blester125 I've reviewed the new licenses with Aviya and we trimmed them down further for an abundance of caution.
The number of included datasets is now ~340, as compared to ~500 before. Everything appears to be working!
I pulled the branch to double check that the dolma stuff is working, but when I run download with the include_test.csv
file it seems like neither of the datasets have a user_parent
key. Is there a step I missed? The README just says to run download.
Also, how much effort would it be to setup so it runs from the data_provenance dir instead of one level up where we have to set the PYTHONPATH
?
user_parent
@blester125 sorry the new HuggingFace object key is "dataset" not "user_parent"! I just updated it -- thanks for catching.
As for PYTHONPATH
I'm not clear actually on what the changes are? python paths always confuse me
I just pushed a commit that fixes some issues I found while testing (the include_test.csv was missing the 'GitHub License) and I updated it so the code should be run from the
data_provenancedir instead of the repo root (which removes the need for the
PYTHONPATH). and the datasets in the test csv seemed to be using
targetsinstead of
labels` so I updated the code to be able to handle both.
I was able to run the download and to dolma script with the new include_test.csv
and the results look good, I think it's ready to merge but @shayne-longpre should take a quick look at my changes to make sure I didn't miss anything.
Adding the Data Provenance data.