Closed mosheeshel closed 8 years ago
Thanks for reporting. Yes I'm aware of this but don't have a good solution for it. We export BQ to Avro on GCS when reading them into Spark but can't delete them right away since they may not be cached in memory. This shouldn't be a problem if you let Dataproc manage default bucket but otherwise you'll have to clean up tmp files later. I'll update the README.
Actually the BQ -> DataFrame part is handled by BigQuery Hadoop connector and should be fine. For the DF -> BQ part we save DF as Avro on GCS and then load it into BigQuery. I'll see if we can clean up right away.
@mosheeshel fixed c3123d2. You wanna give it a try before I release?
sure, give me a minute
sorry @nevillelyh forgot I didn't have local environment for this, working from the published artifact... looked at the code fix, looks like it should do what we need, though it could increase the step time significantly
It works for me locally with a credential JSON file. It does add a little extra time but shouldn't be a problem unless your DF is extremely large and ends up with many Avro files.
I'll close this and go ahead with the release.
Thanks for an excellent library, saved me a ton of work!
It seems that running saveAsBigQueryTable creates many files in the DataProc bucket, seems like these are tmp files from the process, but they do not get cleaned up when cluster is deleted, and no reference to cleanup action, found them by mistake when doing some manual digging in the bucket.
Just a tiny sample (there where many many thousands) /hadoop/tmp/spark-bigquery/spark-bigquery-1470089007094=105594505/_temporary/0/_temporary/attempt_201608012203_0012_m_000931_0/#1470089014836000... /hadoop/tmp/spark-bigquery/spark-bigquery-1470142403992=2044817433/part-r-01462-db5b41ac-a0ac-4a52-9e8b-f2b368c97cd6.avro#1470142479116000...
These might be also part of aborted contexts, hard to know. in that case you're off the hook :smile: but would appreciate a mention in the docs about this.