Closed conceptofmind closed 4 months ago
The previous dumps also contain data that was purposefully removed in the newer dumps so the diffs should not be included.
LGTM, lets fix the lint error and add a comment about why only the newest data is needed so we don't forget and think that more dates is an easy way to get more data lol
Will do. Other parts of the script need to be changed too but this is the most glaring issue. Will need to fix the rest of the text columns.
Added comments:
# Only download the data from most recent CL dump
# The newest dump contains the previous dumps data
# Differences from the previous data should not be included
And the lint should be ok for the other file now. Used black/isort.
Only the most recent dump should be used which is now
2024-05-06
. You are making numerous duplicates of the same data otherwise. This is confirmed by Mike Lissner.I will update a script to include the rest of the Spark code next.