cleanup from dataproc bucket

mosheeshel commented 8 years ago

Thanks for an excellent library, saved me a ton of work!

It seems that running saveAsBigQueryTable creates many files in the DataProc bucket, seems like these are tmp files from the process, but they do not get cleaned up when cluster is deleted, and no reference to cleanup action, found them by mistake when doing some manual digging in the bucket.

Just a tiny sample (there where many many thousands) /hadoop/tmp/spark-bigquery/spark-bigquery-1470089007094=105594505/_temporary/0/_temporary/attempt_201608012203_0012_m_000931_0/#1470089014836000... /hadoop/tmp/spark-bigquery/spark-bigquery-1470142403992=2044817433/part-r-01462-db5b41ac-a0ac-4a52-9e8b-f2b368c97cd6.avro#1470142479116000...

These might be also part of aborted contexts, hard to know. in that case you're off the hook :smile: but would appreciate a mention in the docs about this.

nevillelyh commented 8 years ago

Thanks for reporting. Yes I'm aware of this but don't have a good solution for it. We export BQ to Avro on GCS when reading them into Spark but can't delete them right away since they may not be cached in memory. This shouldn't be a problem if you let Dataproc manage default bucket but otherwise you'll have to clean up tmp files later. I'll update the README.

nevillelyh commented 8 years ago

Actually the BQ -> DataFrame part is handled by BigQuery Hadoop connector and should be fine. For the DF -> BQ part we save DF as Avro on GCS and then load it into BigQuery. I'll see if we can clean up right away.

nevillelyh commented 8 years ago

@mosheeshel fixed c3123d2. You wanna give it a try before I release?

mosheeshel commented 8 years ago

sure, give me a minute

mosheeshel commented 8 years ago

sorry @nevillelyh forgot I didn't have local environment for this, working from the published artifact... looked at the code fix, looks like it should do what we need, though it could increase the step time significantly

nevillelyh commented 8 years ago

It works for me locally with a credential JSON file. It does add a little extra time but shouldn't be a problem unless your DF is extremely large and ends up with many Avro files.

I'll close this and go ahead with the release.

spotify / spark-bigquery

cleanup from dataproc bucket #9