Mitigate document loading issues caused by Google Drive API #export method's unreliable size limit

Morred commented 1 year ago

Problem Description

Not exactly a feature request, but this is the template that worked best because it's not really a bug in the features of this app itself.

TL;DR This issue is related to behavior of the files.export method of the Google Drive API library, whose export size limit is apparently subject to changes without notice. This has been causing failures to load documents that previously worked perfectly fine and haven't changed in the meantime.

Details We've recently started having issues where we are getting 500 responses when loading certain pages/documents, caused by files.export (https://developers.google.com/drive/api/reference/rest/v3/files/export) returning 403 because the exported content is supposedly too large. According to the documentation, the exported content is limited to 10MB - however, our affected pages/documents did not change in size nor are they larger than 10MB when this problem started happening.

When digging into this some deeper, an old issue on the Google issue tracker came up (https://issuetracker.google.com/issues/36761333) that describes a similar problem. One comment states the following:

I reached out to the engineering team, and this error is working as expected. There is a limit as to how large of a file the files.export() endpoint can handle, and this error is thrown when files are larger than limit. In my testing I'm seeing the limit closer to 10MB, but it's worth noting that the limit is subject to change without warning.

As a workaround, you can use the Drive v2 API's files.get() endpoint to retrieve the exportLinks for the file and fetch that URL instead. From my testing that URL does not have the same limit as files.export() and more closely matches the behavior seen in the Google Docs UI's "File > Download as" menu item.

Seeing as we didn't make any changes to our documents, it looks like the limit was in fact changed without warning, which leads to some documents being unable to load.

Feature

One way to address the problem mentioned in the thread on the Google issue tracker (see the quote above in the Problem Description section) is using the exportLinks of the file instead of the #export method. This is of course less conventient, but it doesn't seem to have a size limit. It could practically work something like this:

get the file's exportLinks (https://developers.google.com/drive/api/reference/rest/v3/files#exportLinks) aka direct download links, by including it in the fields here: https://github.com/nytimes/library/blob/main/server/list.js#L106
reach it through to here: https://github.com/nytimes/library/blob/main/server/docs.js#L43
directly make an HTTP call to one of them instead of using the #export method
alternatively, the regular #export method could be called first, and if it fails with that 403 error response, then a direct HTTP call could be made to one of the export links

I have a semi-complete demo PR on our fork of this repo, which I'd be happy to share once it's done. If it works well, we'll most likely add this or a similar change to our fork, but we though it would be good if we could coordinate this upstream as well.

Another option, if these changes don't sound desirable, would be to leave things as they are, but at least mention this size limit and how it might arbitrarily change as a known limitation in the Readme, in case others run into it and experience the same problems.

Additional Information

I'd be happy to get some feedback on this, and I'm open for alternative options or approaches.

Morred commented 1 year ago

This is the error message that bubbles up:

2023-06-16T12:09:02.777904+00:00 app[web.1]: {
2023-06-16T12:09:02.777905+00:00 app[web.1]: message: 'This file is too large to be exported.',
2023-06-16T12:09:02.777905+00:00 app[web.1]: stack: 'Error: This file is too large to be exported.\n' +
2023-06-16T12:09:02.777906+00:00 app[web.1]: '    at Gaxios._request (/app/node_modules/gaxios/build/src/gaxios.js:129:23)\n' +
2023-06-16T12:09:02.777907+00:00 app[web.1]: '    at runMicrotasks (<anonymous>)\n' +
2023-06-16T12:09:02.777907+00:00 app[web.1]: '    at processTicksAndRejections (node:internal/process/task_queues:96:5)\n' +
2023-06-16T12:09:02.777907+00:00 app[web.1]: '    at async JWT.requestAsync (/app/node_modules/google-auth-library/build/src/auth/oauth2client.js:343:18)\n' +
2023-06-16T12:09:02.777908+00:00 app[web.1]: '    at async fetchHTMLForId (/app/server/docs.js:69:18)\n' +
2023-06-16T12:09:02.777908+00:00 app[web.1]: '    at async Promise.all (index 0)\n' +
2023-06-16T12:09:02.777908+00:00 app[web.1]: '    at async fetch (/app/server/docs.js:94:24)\n' +
2023-06-16T12:09:02.777908+00:00 app[web.1]: '    at async exports.fetchDoc (/app/server/docs.js:41:20)\n' +
2023-06-16T12:09:02.777909+00:00 app[web.1]: '    at async handleCategory (/app/server/routes/categories.js:68:47)',
2023-06-16T12:09:02.777909+00:00 app[web.1]: response: {
2023-06-16T12:09:02.777909+00:00 app[web.1]: config: {
2023-06-16T12:09:02.777912+00:00 app[web.1]: url: 'https://www.googleapis.com/drive/v3/files/1BqyfQAGbelprPuN8kOXUKpsPfg4qsLoorXAPwdM4Slc/export?mimeType=text%2Fhtml',
2023-06-16T12:09:02.777913+00:00 app[web.1]: method: 'GET',
2023-06-16T12:09:02.777913+00:00 app[web.1]: paramsSerializer: [Function (anonymous)],
2023-06-16T12:09:02.777913+00:00 app[web.1]: headers: {
2023-06-16T12:09:02.777913+00:00 app[web.1]: 'x-goog-api-client': 'gdcl/3.2.2 gl-node/16.20.0 auth/6.1.6',
2023-06-16T12:09:02.777913+00:00 app[web.1]: 'Accept-Encoding': 'gzip',
2023-06-16T12:09:02.777913+00:00 app[web.1]: 'User-Agent': 'google-api-nodejs-client/3.2.2 (gzip)'

afischer commented 1 year ago

Hey @Morred, thanks for the issue. We've recently been seeing the same issue on a few of our documents as well, and have been looking into workarounds. If you have a working proof of concept fix you can share or are able to make a PR, that would be much appreciated!

Morred commented 1 year ago

Will do once I have something that works!

rupertdance commented 1 year ago

Very interesting, we are seeing this exact issue as well. Attempting fixes by splitting many documents in half, re-sizing images etc. A cleaner fix would be desirable however!

Morred commented 1 year ago

It seems like Google has fixed things on their end since yesterday or so, and all the pages that weren't loading before for us are now loading again. Can anyone else here confirm that it's the same for them?

That said, who knows when it will break again and for how long 😬 So I'm going to share what I've looked into, what has worked and what hasn't so far.

The best option I've found so far (with significant caveats described later) was using the file's export link as a fallback method if calling the Google Drive #export endpoint fails with 403 - File too large to export. I'll copy out the most relevant parts below, but can provide a full PR if so desired.

Put this https://github.com/nytimes/library/blob/main/server/docs.js#L56 into a try/catch block and fall back to exporting the data via export link:

try {
  const {data} = await drive.files.export({
    fileId: id,
    // text/html exports are not supported for slideshows
    mimeType: resourceType === 'presentation' ? 'text/plain' : 'text/html'
  })

  return data
} catch (e) {
  const errorResponse = e.response.data.error
  // If the Google Drive API returns 403, we fall back to using the export link directly
  if (errorResponse.code === 403 && errorResponse.message === "This file is too large to be exported.") {
    console.log("falling back to using the export link...")
    const manuallyFetchedData = await fetchManually(resourceType, exportLinks)
    return manuallyFetchedData
  } else {
    throw e
  }
}

Here's the function that does the manual exporting:

async function fetchManually(resourceType, exportLinks) {
  const accessToken = await getAccessToken()
  const exportLink = exportLinks['text/html']
  const headers = {Authorization: `Bearer ${accessToken}`}

  const fetchedData = await axios({
    url: exportLink,
    method: 'GET',
    responseType: resourceType === 'presentation' ? 'text/plain' : 'text/html',
    headers: headers
  })
    .then((response) => {
      const fileContents = response.data
      return fileContents
    })
    .catch((err) => {
      console.error('Error downloading file:', err)
    })

  return fetchedData
}

This works locally, but there are two quite significant downsides:

It takes a really long time (30+ seconds). When I was testing, it took the #export endpoint quite a while to respond with 403 in the first place (occasionally up to 10-20 seconds). Once it falls back to using the export link, that will then take its sweet time as well because it's literally downloading the file. It works locally, but when I tested in our staging environment hosted on Heroku, it would routinely trigger the Heroku router's 30 second timeout. And of course it's also not exactly user friendly to have to wait for half a minute or more until your page loads.
This leads to the second downside, which is the fact that using the export link directly isn't optimized for exporting the file's contents in a nice way, the way the #export endpoint is. From what I can see, it literally downloads the whole thing into memory, which as I mentioned before is not particularly fast, and also kind of a resource hog.

It's probably possible to improve the performance on this, for example by cutting out the call to #export completely and only use the download link (if that's desirable is another question), and see if there's a reasonable way stream and chunk-process the response, for example. That would become pretty involved though, and probably needs quite a few changes in comparison to how things are done now.

One small thing we could do right away in the meantime is to add some information to the Readme, specifically

There is a size limit for each page/document, which according to the Google Drive API documentation is 10MB under normal circumstances
The size limit may be changed without announcement from Google's side, which can result in pages failing to load that were previously working. At least that would help people figure out what's going on when this happens, even if there's no working solution yet.

nytimes / library