lookit / lookit-api

Codebase for Lookit v2 and Experimenter v2. Includes an API. Docs: http://lookit.readthedocs.io/
https://lookit.mit.edu/
MIT License
10 stars 18 forks source link

Handle large single video downloads #1408

Closed becky-gilbert closed 2 months ago

becky-gilbert commented 3 months ago

Fixes #1404

UPDATE

I've re-implemented this with an entirely different solution - see the "Implementation update" section in #1404.

Summary

This PR fixes the problem with attempted downloads of very large single videos causing out-of-memory errors. Instead of downloading the whole file onto the server and then delivering it as our response to the client, we now just redirect the client to the S3 download URL. These pre-signed time-limited URLs are already available to researchers.

Implementation notes

To get the video file to download automatically (vs open in a browser tab), we need to set a header value at the point when we request the pre-signed URL, which is in the get_download_url function in attachment_helpers.py. Previously this function always returned a single URL, so I've added a parameter that allows it to either return a download URL or a view URL. I've also changed the name to get_url to reflect this, and therefore had to update the code in a few places where it is used.

Also, we had a video property that stored the download URL value that was retrieved from the get_download_url function. So I've changed that to only store the actual download URL, and added a new property that gets the view URL. I've also therefore had to change the code in a few places where we were referencing the video's download_url property but where we actually need the view_url.

Testing

This has been tested locally. The following actions need testing on staging/production, on Chrome and Firefox:

OLD VERSION - for reference

Summary

This PR fixes the problem with attempted downloads of very large single videos causing out-of-memory errors. In the video download request, I added a check for the file's content length. If the content length is absent or too large, then it keeps the user on the Individual Responses page and returns a banner error message:

The file size for {video.filename} could not be verified. Please contact the CHS team on Slack or at childrenhelpingsciencespan>@</spangmail.com.

The file {video.filename} is too large to download. Please contact the CHS team on Slack or at childrenhelpingsciencespan>@</spangmail.com.

If neither of those conditions are true, then the file is downloaded.

Questions

  1. What should the file size limit be?
  2. Instead of just presenting an error message, do we want to try to handle the downloading of a large file with an offline celery process to email a zip to the researcher (similar to how the 'Download all videos' option works)? If so, this will take longer to implement.

Implementation notes

I first tried to get the content length with a HEAD request (requests.head(url)), but AWS doesn't return the content-length in response to a HEAD request. So instead I used requests.get(url, stream=TRUE), which gets the headers for a GET request and doesn't actually download the content until it is specifically requested. But using stream=TRUE means that the connection stays open until it is closed, which is why it's now in a with statement.

You can test the file size limit locally by reducing the MAX_FILESIZE_BYTES value and trying to download a video file that you know is larger. You will probably want to test this with existing response/video data (e.g. downloaded off the staging site) or you can run through a local study using an older Pipe version of the experiment runner, because the newer RecordRTC system does not create new Video objects in a local database (yet).

To test the absence of the content-length info, you can change with requests.get(download_url, stream=True) as r: to with requests.head(download_url) as r:.

becky-gilbert commented 3 months ago

@ianchandlercampbell @okaycj this is a draft because I'm still writing tests, but in the meantime it might be useful to get your feedback.

ianchandlercampbell commented 3 months ago

@becky-gilbert This seems like good progress! I think the copy for the warning messages makes sense. My question for now is: Do we have limitations whether we can verify file size on RecordRTC? I think we can talk more about this next week as well and I'm happy to do some user-level testing then as well.

becky-gilbert commented 3 months ago

My question for now is: Do we have limitations whether we can verify file size on RecordRTC? I think we can talk more about this next week as well and I'm happy to do some user-level testing then as well.

Thanks Ian, good question. There's no need to do any separate testing with RecordRTC because the videos are stored the same way and download process works the same for Pipe and RecordRTC. I only wrote the note about that because, in order to test this code locally you need to have a video object in your database (which you then try to download), and you can't add new video objects with a local RecordRTC study. So the RecordRTC vs Pipe issue here is just about generating the local data that's needed to test this, rather than any differences that affect the way this download code works. I hope that makes sense. But yes of course we can talk next week. And user testing on the staging site would be helpful!

sonarcloud[bot] commented 2 months ago

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

becky-gilbert commented 2 months ago

@okaycj thanks for the review!

I responded to your comments and modified the code style for the S3 request here: https://github.com/lookit/lookit-api/pull/1408/commits/93a626e945487806ab57a3f081963ff3c5abcab0 I agree this is much cleaner, and it clarifies the only difference between the two conditions - thanks!!