Update get_data.sh - Githubissues

r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data

MIT License

22 stars 6 forks source link

Update get_data.sh #81

Closed conceptofmind closed 4 months ago

conceptofmind commented 4 months ago

Only the most recent dump should be used which is now 2024-05-06. You are making numerous duplicates of the same data otherwise. This is confirmed by Mike Lissner.

I will update a script to include the rest of the Spark code next.

conceptofmind commented 4 months ago

The previous dumps also contain data that was purposefully removed in the newer dumps so the diffs should not be included.

blester125 commented 4 months ago

LGTM, lets fix the lint error and add a comment about why only the newest data is needed so we don't forget and think that more dates is an easy way to get more data lol

conceptofmind commented 4 months ago

Will do. Other parts of the script need to be changed too but this is the most glaring issue. Will need to fix the rest of the text columns.

conceptofmind commented 4 months ago

Added comments:

# Only download the data from most recent CL dump
# The newest dump contains the previous dumps data 
# Differences from the previous data should not be included

And the lint should be ok for the other file now. Used black/isort.