googleapis / nodejs-storage

Node.js client for Google Cloud Storage: unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.
https://cloud.google.com/storage/
Apache License 2.0
891 stars 367 forks source link

Uploading big files in chunks with resumable upload #2369

Closed mx-wndl-zb closed 7 months ago

mx-wndl-zb commented 8 months ago

I am trying to upload big files into the google cloud storage by creating a resumable upload and appending every chunk to the original file. I can guarantee that the chunks are transmitted in the correct order and the next chunk is only sent when the previous one is finished. If I understand the documentation correctly this should be possible. I'm using version 7.6.0 of the library.

The node server is a microservice so I can not assure that all chunks are uploaded from the same instance, which is why I am storing the resumableUploadUri and the resumeCRC32C in my database together with an uuid and append chunks to the file by POST /upload?uuid={fileUuid}.

My issue is that I can't get the upload to work when tryping resumable upload in chunks. If I upload files in one single request it works like a charm.

A somewhat minimum example:

// Assume this is the input chunk as a file stream
const fileStream;
// Assume this is the metadata from the input file/chunk
const fileInfo;
// Comes form config file
const chunkSize;

const [ start, end, totalFileSize ] = req.headers['content-range'].match(/\d+/g).map(num => parseInt(num));
let uuid = req.query.uuid;
let resumableUrl, savePath, gcsFile, dbFile;

if(uuid) {
  // Load from database
  dbFile = await this.db.findOne({uuid: uuid});

  if(!dbFile) {
    throw new Error('not found');
  }

  {resumableUrl, savePath} = dbFile;
  gcsFIle = await gcsBucket.file(savePath);
} else {
  uuid = uuidv4();
  savePath = createSavePath();
  gcsFile = await gcsBucket.file(savePath);

  resumableUrl = (await gcsFile.createResumableUpload())[0];
}

const writeStream = gcsFile.createWriteStream({
  resumable: true,
  uri: resumableUrl,
  offset: start || 0,
  metadata: {
    contentType: fileInfo.mimeType,
  }
});

fileStream.pipe(writeStream);

writeStream.on('finish', () => {
  writeStream.end();
  await this.db.updateOne( {uuid: uuid}, {resumableUrl: resumableUrl} );
  resolve(); // Resolve uploading promise and go back to 'main loop'
}

With this configuration I get The CRC32C is missing for the final portion of a resumed upload, which is required for validation. Please provide 'resumeCRC32C' if validation is required, or disable 'validation'.

I understand that, so first I tried with validation: false:

const writeStream = gcsFile.createWriteStream({
  resumable: true,
  uri: resumableUrl,
  offset: start || 0,
  metadata: {
    contentType: fileInfo.mimeType,
  },

  validation: false,
});

Then the first chunk seems to be uploaded successfully. On the second chunk I get the following error: Error: Retry limit exceeded - Invalid request. According to the Content-Range header, the upload offset is 3145728 byte(s), which exceeds already uploaded size of 0 byte(s).
The offset matches my chunk size for the particular test file and should be correct assuming the first chunk was uploaded successfully. If I check the storage in the Google Cloud Console the file exists and it also has the size of the first chunk, which is 3145728 byte(s).

After that I tried it with including validation by also saving the crc32c from the previous chunk in the database :

const writeStream = gcsFile.createWriteStream({
  resumable: true,
  uri: resumableUrl,
  offset: start || 0,
  metadata: {
    contentType: fileInfo.mimeType,
  },

  resumeCRC32C: resumeCRC32C,
  validation: 'crc32c',
});

Error is the same like the first one: The CRC32C is missing for the final portion of a resumed upload, which is required for validation. Please provide 'resumeCRC32C' if validation is required, or disable 'validation'.

By looking inside the code of the framework I found out that there is also a flag isPartialUpload, which needs to be true otherwise the error above is thrown. The flag also needs a chunkSize, which I added as well:

const writeStream = gcsFile.createWriteStream({
  resumable: true,
  uri: resumableUrl,
  offset: start || 0,
  metadata: {
    contentType: fileInfo.mimeType,
  },

  resumeCRC32C: resumeCRC32C,
  validation: 'crc32c',
  isPartialUpload: true,
  chunkSize: opts.chunkSize,
});

Then I come back to the error from validation: false. Error: Retry limit exceeded - Invalid request. According to the Content-Range header, the upload offset is 3145728 byte(s), which exceeds already uploaded size of 0 byte(s).

What am I missing here?

danielbankhead commented 7 months ago

Hey @mx-wndl-zb,

Your samples look very close to the appropriate setup, however you'd want to use isPartialUpload on all uploads that are not the final portion of the object. Also note the 'partial' uploads must be a multiple of 256 KiB (256 x 1024 bytes). Here's a complete sample in our integration tests for reference:

https://github.com/googleapis/nodejs-storage/blob/7a96ce6f764076a14f0961623e2dec2ce8893dd7/system-test/kitchen.ts#L255-L307

mx-wndl-zb commented 7 months ago

Sorry for the late response. Thank you so much @danielbankhead! This totally fixed the errors.

As I could not find this information in the docs it would maybe be a nice addition so that people don't run into this error.