BigQuery can't load gzip'ed files larger than 4 GiB

iconara / bigshift

A tool for moving tables from Redshift to BigQuery

BSD 3-Clause "New" or "Revised" License

65 stars 10 forks source link

BigQuery can't load gzip'ed files larger than 4 GiB #2

Closed iconara closed 5 years ago

iconara commented 8 years ago

When trying to load gzip'ed dumps into BigQuery I get this error:

Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 5838980665. Max allowed size is: 4294967296. Filename: …

Not compressing the files during the transfer is not an attractive option, it will cost too much. Is there some way to tell Redshift to produce smaller files?

gravier commented 8 years ago

Hey first thanks for putting up this scripts makes my life easier trying to migrate some data from redshift, though still struggling with limitations atm. Did you manage to find a workaround for this issue in particular? The problem is that if data is not gzipped I hit the other issue with transfer limitations as it goes over 20 files easily... Probably need to think of some uncompression script on VM otherwise. Thanks

iconara commented 8 years ago

No, sorry, this error is unresolved. I have a workaround where I transfer the files compressed to GCS, and then spin up a 32 core GCP instance to decompress and re-upload the files to GCS, and the I run the BigQuery load.

I basically run this in a startup script:

gsutil ls "${source_prefix}*" | parallel "gsutil cp {} - | gzip -d | gsutil cp - ${destination_prefix}\$(basename {})"

The transfer limitation issue shouldn't be a problem anymore, it was fixed in #3. The limitation on 20 files is for how many prefixes are specified in the job, but I've changed BigShift to only specify one prefix now. I see that I haven't closed the issue though, so I'll go ahead and do that.

gravier commented 8 years ago

ok thanks for sharing the workaround will try it once I have to deal with it again. I wasn't sure if transfer had this problem, but if it doesn't then uncompressed option will be working.