tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.27k stars 1.53k forks source link

Custom video dataset encoding/serialize uses all memory, process killed. How to fix? #5499

Open cleong110 opened 2 months ago

cleong110 commented 2 months ago

What I need help with / What I was wondering

I want to load a dataset containing these image without this happening (Colab notebook for replicating) image

...How can I edit my dataset loader to use less memory when encoding videos?

Background: I am trying to load a custom dataset with a Video feature. When I try to tfds.load() it, or even just download_and_prepare, RAM usage goes up very high and then the process gets killed. For example this notebook will crash if allowed to run, though with a High-RAM instance it may not. It seems it is using over 30GB of memory to encode one or two 10 MB videos. I would like to know how to edit/update this custom dataset so that it will not use so much memory.

What I've tried so far image

I did a bunch of debugging and tracing of the problem with memray, etc. See this notebook and this issue for detailed analysis including a copy of the memray report.

Tried various different ideas in the notebook, including loading just a slice, editing buffer size, and switching from .load() to download_and_prepare()

Finally I traced the problem to serializing and encoding steps under the See this comment, which was allocating many GiB of memory to encode even one 10MB video.

I discovered that even one 10MB video was extracted to over 13k video frames, taking up nearly 5GiB of space. And then the serializing would take up 14-15 GiB, and the encoding would take another 14-15, and so the process would be killed.

Relevant items:

It would be nice if...

Environment information I've tested it on Colab and a few other Ubuntu workstations. High-Ram Colab Instances seem to have enough memory to get past this.

tomvdw commented 2 months ago

Hey,

Thanks for your question. Those are some cool datasets! I'm very sorry to hear that you're running into these problems.

We brainstormed a bit and came up with a couple of ideas:

  1. 14-15GB for 13k frames means that each frame takes up ~1MB. IIUC ffmpeg extracts frames as PNG files. Switching to JPG could maybe bring ~5x savings. However, you'd still end up with ~3GB for a 10MB video. Not great.
  2. Store the encoded video in the dataset. This means that the video will stay 10MB, but that the decoding needs to happen when you use the data. I'm not sure if using ffmpeg to decode when training would be a good solution (i.e. running a separate tool that writes 14-15 GB to disk, then read those 14-15 GB from disk). Alternatively, there seem to be Python libraries that can read videos, e.g. OpenCV.

Even if we make storing encoded videos work, I'm worried that the problem would just be moved to when the dataset is used. Namely, reading a single example would still require 14-15 GB of memory.

After the dataset has been prepared, how are you expecting that it will be used? Would it make sense to lower the FPS (it's 50 now right)? Will users only use chunks of the video? If so, perhaps you can store the chunks instead of the entire video.

Kind regards, Tom

cleong110 commented 2 months ago

Tom,

Thank you very much for your reply, and those ideas!

How will they be used:

I'm just getting into Sign Language Processing research, so I'm still not quite sure how I want to use these, but potentially for training translation models for signed language videos to spoken-language text, or for pretraining a vision transformer, or a bunch of other things A few use-cases follow:

test out models on real data

I figured I'd start learning by at least running some inference pipelines with already-trained models, and got stuck on this step. I expected running a model to take significant memory, but didn't expect that loading the video would be the issue. I guess I'm successfully learning things! Specifically I'd like to load in some videos and run this demo of segmentation+recognition pipeline.

replicate other research on github

I went looking for examples of people using these, and it seems that not many use the video option, perhaps for this very reason, that loading them is too cumbersome.

replicate WMT results, or at least re-run their models

One thing I wanted to do was replicate results for the WMT Sign Language Translation contests, which provides data in a number of formats including video, and a number of the submissions do use video as inputs instead of poses.

At least load the videos and then run pose estimation on them

Another thing I wanted to do was to be able to load the videos, run a pose estimator on them, and then use that, in order to potentially improve that part of the pipeline. A number of sign language translation models take pose keypoints as inputs, and I'd like to try those out.

At the very least I'd like to be able to do this! And then the pose methods may take less compute from there.

cleong110 commented 2 months ago

Regarding the suggestions:

  1. seems pretty easy to test, worth a shot!
  2. I admit I'm pretty ignorant about this, what is the encoding/decoding even doing exactly? What would it mean to store the encoded video, decode later, etc.? I read about it a bit, and I think I understand that encoding is to compress the frames to a video format, and decode is to expand out to the frames...? If so, then is there a way to load in only some limited number of the frames at a time? And why does the dataset need to encode when it's already encoded as a .mp4?

I guess I'd like to be able to, and I don't know if any of this is feasible, but:

Did some further Googling, and I found a few things:

cleong110 commented 2 months ago

FPS lowering: that's another good idea, I think there might be a method in there to set that already. Maybe tweaking that would reduce memory usage, I can try.