jimmywarting / StreamSaver.js

StreamSaver writes stream to the filesystem directly asynchronous
https://jimmywarting.github.io/StreamSaver.js/example.html
MIT License
3.95k stars 413 forks source link

RAM ballooning on downloading large zip files [OOM] #293

Open mexicantexan opened 1 year ago

mexicantexan commented 1 year ago

Heyo,

Been tinkering with different configs/imports/ways to call/etc and going through older issues related to my problem for over a day now but can't seem to find how my implementation is messed up. On a 10GB zip download, the script will error out after about 2GB. I'm seeing two peculiarities with how I implemented this library

  1. Post Message Error - Uncaught TypeError: [StreamSaver] You didn't send a messageChannel at onMessage (mitm.html?version=2.0.0:66:11
  2. RAM seems to match about every MB downloaded (so if I'm downloading 1GB, RAM will increase by 1GB), but the heapsize stays very low at about 20MB.

For the post message error, I get about 7 of those when the download first starts and then another at a regular interval after that which makes sense with what you're doing under the hood. Just not sure why it's erroring, and wondering if it's how I am calling streamsaver. Here's the code that I've got so far:

import streamSaver from 'streamsaver';

export const downloadZip = (payload) => async (_dispatch, getState) => {
  try {
    const {
      userLogin: {userInfo},
    } = getState();

    const controller = new AbortController();
    const timeoutId = setTimeout(() => {
      console.log('setTimeout fired')
      controller.abort()
    }, 3000);

    const res = await fetch(`${DOMAIN}/api/v1/processes/download/`, {
      method: 'POST',
      headers: {
        'Content-type': 'application/json',
        Authorization: `Bearer ${userInfo.access}`,
      },
      body: JSON.stringify({zips: payload, requestingUserId: userInfo.id}),
      signal: controller.signal
    });
    clearTimeout(timeoutId)
    const contentLength = Number(res.headers.get("content-length"))

    console.log(`response length: ${Number(res.headers.get("content-length")) > 0}`)
    console.log(`${Number(res.headers.get("content-length"))}`)

    const fileStream = streamSaver.createWriteStream(`Batch_Download_${new Date().getTime()}.zip`, {
      size: contentLength,
      writableStrategy: new ByteLengthQueuingStrategy({highWaterMark: 1}),
      readableStrategy: new ByteLengthQueuingStrategy({highWaterMark: 1}),
    });
    console.log("returning")
    const usePipeToBool = true
    if (usePipeToBool) {
      return res.body.pipeTo(fileStream).catch(err => {
          return err
      }
    })

    const pump = async () => {
      await res.body.read().then(res => {
          return res.done
            ? writer.close()
            : writer.write(res.value).finally(pump)
        }
      )
    }
    pump()
      .then(() => console.log('Closed the stream, Done writing'))
      .catch((err) => console.error(`Error downloading: ${err}`));

    return {};
  } catch (error) {
    return error.response && error.response.data.detail ? error.response.data.detail : error.message;
  }
};

Environment: NPM 8.17; streamsaver - 2.0,6; react- 17.0.2; working with http not https

Edit 1: Backend info - The backend is Django hosted on a separate computer that I do have control over, but can't open directly up for the end user to download from. The response from Django is a StreamingHttpResponse.

Definitely a me problem and not a you problem. Just seeing if what I coded is wrong in some way?

jimmywarting commented 1 year ago

If you do have control over the server then i do recommend that you should try to save the file the way server usually tells browser to save files by responding with a content-disposition attachment header and a filename, possible a content-length and a content-type set to something like application/octet-stream so browser don't know how to handle the response and resort to saving the file instead.

all the things StreamSaver actually dose is mimic how server saves files (the good old fashion way)

for ref. a download can not be initiated with a http call to the server with either fetch or XMLHttpRequest, it has to be a navigation.

I notice that you are sending a payload and some authentication headers. so a normal GET request would not work. (which is almost always necessary) except the times where you can create a <form> and submit it to include things such as files, or other kind of data.

but you are also sending a authentication request header that makes it a bit more trickier cuz you can't send a request header with forms.

so my recommendation is that you create a other solution for submitting a form (with the payload that you need to send to the server) with out necessary request headers, either by using cookies or creating some one time / expiring url or that you include some kind of token/api key inside of the form or in the url so it dose not have to be sent inside of a request header.

then browser will be able to download files without the need of using streamsaver.

danielRicaud commented 1 year ago

Could the following Chromium issue be a possible root cause? It appears that Chromium first writes files to a sandboxed filesystem in order to first perform a security scan on it, before fully flushing the file into the filesystem.

Would be great to brainstorm a possible work around for this as I'm transforming/decrypting a file as it's downloaded, and then piping it to the filesystem with Stream Saver. Using content-disposition or a <form> element wouldn't work in my case as I don't just want a raw download of the file.

I'm experiencing a similar issue to OP where the RAM usage is ballooning under a process called Google Chrome Helper (renderer), but the heap size stays small at 9 mb. The test file in question than I'm downloading and piping is 6 gb in size just for reference.

https://bugs.chromium.org/p/chromium/issues/detail?id=1168715

edit: I see you're already involved in that thread 🙏

danielRicaud commented 1 year ago

I wanted to come back and leave an update on my progress. I've solved my high RAM usage.

Deep in my code there was a closure that was causing memory to balloon. Something similar to this:

function getChunk () {
    function fetchChunk() {
    }
}

I also changed my pump() logic from .then to a loop based approach to avoid creating more promises/closures.

const pump = async (): Promise<void> => {
      let res = await reader.read();

      if(res.done) {
        return await writer.close();
      }
      else {
        await writer.write(res.value);
        return pump();
      }
    }

instead of:

const pump = (): Promise<void> =>
    reader.read().then((res) => (res.done ? writer.close() : writer.write(res.value).then(pump)));
pump();
JYbill commented 3 months ago

beacuse http cannot paused by code. so my solved is using http range download and using streamSave.js to save.

code: https://github.com/JYbill/xqv-solution/blob/91b25358eb157f965c76b6e2c1aa0bca11d7ec85/packages/oversize-file-download/src/assets/index.html#L149

by the way, use is best pratice