Data Provenance data - Githubissues

r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data

MIT License

22 stars 6 forks source link

Data Provenance data #61

Closed shayne-longpre closed 5 months ago

shayne-longpre commented 6 months ago

Adding the Data Provenance data.

shayne-longpre commented 6 months ago

@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the to_dolma call as I'm unable to pip install dolma for some reason... any ideas?

shayne-longpre commented 6 months ago

Also, how do we run the default linter?

Skylion007 commented 6 months ago

@shayne-longpre run pre-commit

shayne-longpre commented 6 months ago

@shayne-longpre run pre-commit

Could you direct me on how to run that?

Skylion007 commented 6 months ago

pip install pre-commit pre-commit install pre-commit run -a

shayne-longpre commented 6 months ago

pip install pre-commit pre-commit install pre-commit run -a

Done. Thanks!

Skylion007 commented 6 months ago

We should just setup a pre-commit.ci here to do autoformat PRs. I'd be happy to do it if I can get proper permissions

blester125 commented 6 months ago

We should just setup a pre-commit.ci here to do autoformat PRs. I'd be happy to do it if I can get proper permissions

We already have a CI that blocks merges until the formatting is correct (uses black and isort), so I don't think we need to have the CI actually re-write commits. The tools are pretty reliable, but I think using pre-commit (and thus having people need to spot check formatting changes by adding them back to git staging) is less error prone.

shayne-longpre commented 5 months ago

@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the to_dolma call as I'm unable to pip install dolma for some reason... any ideas?

Still hitting this issue cc @blester125

blester125 commented 5 months ago

@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the to_dolma call as I'm unable to pip install dolma for some reason... any ideas?

Still hitting this issue cc @blester125

:thinking: What version of things are you using? I'm on python3.11 and I was able to install both dolma 0.9.1 and dolma 1.0.3 on ubuntu 22.04 and 23.04 respectively.

Maybe try to see if you can bump your python version so that there is a pre-built wheel for it? The list of wheels are here https://pypi.org/project/dolma/#files

shayne-longpre commented 5 months ago

@blester125 I've reviewed the new licenses with Aviya and we trimmed them down further for an abundance of caution.

The number of included datasets is now ~340, as compared to ~500 before. Everything appears to be working!

blester125 commented 5 months ago

I pulled the branch to double check that the dolma stuff is working, but when I run download with the include_test.csv file it seems like neither of the datasets have a user_parent key. Is there a step I missed? The README just says to run download.

Also, how much effort would it be to setup so it runs from the data_provenance dir instead of one level up where we have to set the PYTHONPATH?

shayne-longpre commented 5 months ago

user_parent

@blester125 sorry the new HuggingFace object key is "dataset" not "user_parent"! I just updated it -- thanks for catching.

As for PYTHONPATH I'm not clear actually on what the changes are? python paths always confuse me

blester125 commented 5 months ago

I just pushed a commit that fixes some issues I found while testing (the include_test.csv was missing the 'GitHub License) and I updated it so the code should be run from thedata_provenancedir instead of the repo root (which removes the need for thePYTHONPATH). and the datasets in the test csv seemed to be usingtargetsinstead oflabels` so I updated the code to be able to handle both.

I was able to run the download and to dolma script with the new include_test.csv and the results look good, I think it's ready to merge but @shayne-longpre should take a quick look at my changes to make sure I didn't miss anything.