ian-whitestone / pyspark-vs-dask

[WIP] Comparing pyspark and dask for speed, memory/CPU usage, and ease of use
2 stars 1 forks source link

File cleanup #2

Closed ian-whitestone closed 5 years ago

ian-whitestone commented 5 years ago

When generating the fake data, the scripts started interfering with each other (same filenames) part way, so cancelled the jobs and started with new file prefixes.

Need to clean up the old files with the outdated prefixes.

ian-whitestone commented 5 years ago
import re
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('dask-avro-data')

reg = 'application-data\/\d*.avro|fulfillment-data\/\d*.avro|scoring-data\/\d*.avro'

objects = []
for object in my_bucket.objects.all():
    objects.append(object)

for object in objects:
    if re.match(reg, object.key):
        object.delete()