OutOfMemory Error - Githubissues

Lanceeeelot commented 4 months ago

I get a OutOfMemory Error while trying to extract multiple videos. Their duration range from 3 minutes to 17 minutes. I also tryd the extraction with the "MemoryControlledFileSystemEnumerator". The following Screenshot shows the original error with the "FileSystemEnumerator":

Bildschirmfoto vom 2024-07-11 10-09-11

When i use the MemoryControlledEnumerator the extraction still stops around the same time (same number of thumbnails), but shows the following Log:

Bildschirmfoto vom 2024-07-11 14-16-54

I also checked the memory limit of my fes docker container(and cottontailDB), but that looks like it has plenty of space left: Bildschirmfoto vom 2024-07-11 14-42-17

sauterl commented 4 months ago

You way want to try the CachedContentFactory, cf. https://github.com/vitrivr/vitrivr-engine/wiki/Documentation#content-factory Other than that, I guess @net-cscience-raphael could possibly investigate issues related to the MemoryControlledFileSystemEnumerator

net-cscience-raphael commented 4 months ago

Can you provide me for further investigations:

Your pipeline configuration
A example Dataset

If there is some error in the pipeline, which results in not free all memory, this is expected behavior.

ppanopticon commented 3 months ago

So I have run a few experiments of my own by extracting a video collection using two features (averagecolor and clip). The pipeline is pretty straightforward without any special cases. It branches-off after decoding and extracts features in parallel.

{
  "schema": "vitrivr",
  "context": {
    "contentFactory": "InMemoryContentFactory",
    "resolverName": "disk",
    "local": {
      "enumerator": {
        "path": "/Volumes/VBS24/html/media/V3C",
        "depth": "1"
      },
      "decoder": {
        "timeWindowMs": "1000"
      },
      "thumbs": {
        "maxSideResolution": "500",
        "mimeType": "JPG"
      },
      "filter": {
        "type": "SOURCE:VIDEO"
      }
    }
  },
  "operators": {
    "enumerator": {
      "type": "ENUMERATOR",
      "factory": "FileSystemEnumerator",
      "mediaTypes": [
        "VIDEO"
      ]
    },
    "decoder": {
      "type": "DECODER",
      "factory": "VideoDecoder"
    },
    "selector": {
      "type": "TRANSFORMER",
      "factory": "MiddleContentAggregator"
    },
    "avgColor": {
      "type": "EXTRACTOR",
      "fieldName": "averagecolor"
    },
    "clip": {
      "type": "EXTRACTOR",
      "fieldName": "clip"
    },
    "file_metadata": {
      "type": "EXTRACTOR",
      "fieldName": "file"
    },
    "time_metadata": {
      "type": "EXTRACTOR",
      "fieldName": "time"
    },
    "video_metadata": {
      "type": "EXTRACTOR",
      "fieldName": "video"
    },
    "thumbs": {
      "type": "EXPORTER",
      "exporterName": "thumbnail"
    },
    "filter": {
      "type": "TRANSFORMER",
      "factory": "TypeFilterTransformer"
    }
  },
  "operations": {
    "enumerator": {
      "operator": "enumerator"
    },
    "decoder": {
      "operator": "decoder",
      "inputs": [
        "enumerator"
      ]
    },
    "selector": {
      "operator": "selector",
      "inputs": [
        "decoder"
      ]
    },
    "averagecolor": {
      "operator": "avgColor",
      "inputs": [
        "selector"
      ]
    },
    "clip": {
      "operator": "clip",
      "inputs": [
        "selector"
      ]
    },
    "thumbnails": {
      "operator": "thumbs",
      "inputs": [
        "selector"
      ]
    },
    "time_metadata": {
      "operator": "time_metadata",
      "inputs": [
        "selector"
      ]
    },
    "filter": {
      "operator": "filter",
      "inputs": [
        "averagecolor",
        "clip",
        "thumbnails",
        "time_metadata"
      ],
      "merge": "COMBINE"
    },
    "video_metadata": {
      "operator": "video_metadata",
      "inputs": [
        "filter"
      ]
    },
    "file_metadata": {
      "operator": "file_metadata",
      "inputs": [
        "video_metadata"
      ]
    }
  },
  "output": [
    "file_metadata"
  ]
}

Here are the key insights: Fundamentally, I don't think that there is a memory leak or memory allocation problem in vitrivr-engine. At least I couldn't spot one. However, one must be conscious about how the pipeline works and what the consequences of certain pipeline design decisions are.

The objects requiring memory in case of a video extraction are InMemoryImageContent (the frames) and FloatVectorDescriptor (the extracted features). These objects require a lot of space.
During an extraction, these objects keep accumulating in memory until a Retrievable reaches the PersistingSink. In a typical video extraction scenario, that's the case when a single video has been processed completely.
Consequently, the hardware for a run must be allocated such that the extraction of a single video can take place in memory.

This basic behaviour is illustrated by the graphics.

Now there are several knobs to tune the memory consumption of the extraction pipeline:

By using CachedImageContent instead of InMemoryContent. These will swap data to disk, if memory pressure builds.
By choosing reasonable segment lengths. In the VideoDecoder this can be adjusted using the timeWindowMs parameter, which guides the time covered by a single Retrievable. A higher value will lead to fewer Retrievables, covering larger portions of the videos. Consequently, fewer features are being generated.
By aggregating content within the Retrievable. Instead of keeping all the frames of a retrievable around (and in memory), just keep the one required using FirstContentAggreagtor, LastContentAggregator or MiddleContentAggreagtor (or some implementation of yours).

That being said: In order to be able to debug your issue, we really need your extraction pipeline configuration.

ppanopticon commented 3 months ago

One additional comment just to illustrate my point: If I remove the MiddleContentAggreagtor from above configuration, memory starts to become an issue as well. Because for every second of video, 25 frames are kept in memory and features are extracted for all the 25 frames. This leads to 50 Descriptor and 25 ContentElement per Retrievable, which are kept around in memory until the entire video has been processed.

The video is 5mins and 8GB are not enough to handle this. But this is not an application issue. It's instructing the engine to do something it does not have the resources for.

Lanceeeelot commented 3 months ago

Thanks for the great insight, tests and tips. This is the old pipeline config i used (similar to the one used in Example): video-pipeline.json Here is a small sample of our dataset with our shortest (3:33 min) and longest (18:43 min) video: example_dataset link

I'm going to start a new extraction now and try different tweaks, like using the CachedContentFactory , MiddleContentAggregator or adjusting the timeWindowms. Afterwards i will share my results here.

My Apologies for the late Answer. I only work one day a week on this project.

ppanopticon commented 3 months ago

Hey! Any update on this?

I played with your configuration myself and noticed, that the extraction definition is incorrect. Here is a better version, which I successfully used to extract all three videos using 12GB of RAM

{
  "schema": "vitrivr",
  "context": {
    "contentFactory": "CachedContentFactory",
    "resolverName":"disk",
    "local": {
      "enumerator": {
        "path": "/Users/rgasser/Downloads/example_dataset",
        "depth": "3",
        "skip": "0",
        "limit": "20"
      },
      "decoder": {
        "timeWindowMs": "30_000"
      },
      "filter": {
        "type": "SOURCE:VIDEO"
      }
    }
  },
  "operators": {
    "enumerator": {
      "type": "ENUMERATOR",
      "factory": "FileSystemEnumerator",
      "mediaTypes": ["VIDEO"]
    },
    "decoder": {
      "type": "DECODER",
      "factory": "VideoDecoder"
    },
    "selector": {
      "type": "TRANSFORMER",
      "factory": "LastContentAggregator"
    },
    "averagecolor": {
      "type": "EXTRACTOR",
      "fieldName": "averagecolor"
    },
    "clip": {
      "type": "EXTRACTOR",
      "fieldName": "clip"
    },
    "dino": {
      "type": "EXTRACTOR",
      "fieldName": "dino"
    },
    "whisper": {
      "type": "EXTRACTOR",
      "fieldName": "whisper"
    },
    "ocr": {
      "type": "EXTRACTOR",
      "fieldName": "ocr"
    },
    "meta-file": {
      "type": "EXTRACTOR",
      "fieldName": "file"
    },
    "meta-video": {
      "type": "EXTRACTOR",
      "fieldName": "video"
    },
    "meta-time": {
      "type": "EXTRACTOR",
      "fieldName": "time"
    },
    "thumbnail": {
      "type": "EXPORTER",
      "exporterName": "thumbnail"
    },
    "filter": {
      "type": "TRANSFORMER",
      "factory": "TypeFilterTransformer"
    }
  },
  "operations": {
    "stage-0-0": {"operator": "enumerator"},
    "stage-1-0": {"operator": "decoder","inputs": ["stage-0-0"]},
    "stage-2-0": {"operator": "selector","inputs": ["stage-1-0"]},
    "stage-3-0": {"operator": "clip","inputs": ["stage-2-0"]},
    "stage-3-1": {"operator": "dino","inputs": ["stage-2-0"]},
    "stage-3-2": {"operator": "ocr","inputs": ["stage-2-0"]},
    "stage-3-3": {"operator": "averagecolor","inputs": ["stage-2-0"]},
    "stage-3-4": {"operator": "thumbnail","inputs": ["stage-2-0"]},
    "stage-3-5": {"operator": "meta-time","inputs": ["stage-2-0"]},
    "stage-3-6": {"operator": "whisper","inputs": ["stage-2-0"]},
    "stage-4-0": {"operator": "filter", "inputs": ["stage-3-6", "stage-3-5", "stage-3-4", "stage-3-3", "stage-3-2", "stage-3-1", "stage-3-0"], "merge": "COMBINE"},
    "stage-5-0": {"operator": "meta-file", "inputs": ["stage-4-0"]},
    "stage-6-0": {"operator": "meta-video", "inputs": ["stage-5-0"]}
  },
  "output": [
    "stage-6-0"
  ]
}

Lanceeeelot commented 3 months ago

I tried different configs aswell and was able to extract all videos by increasing the timeWindowms.

The pc im using got 64GB of RAM, so that should not be a problem. I will do more tests tomorrow and try your config on the whole dataset of 19 Videos. Then update here and most likely close this issue. Thanks for the help and guidance.

Lanceeeelot commented 3 months ago

Thanks to your configuration i was able to fully extract all videos without any error. The process created 19.379 thumbnails, before it would stop around 9000 (with the same timeWindowms). While testing different configurations i didn't came across any other mayor Issues or notable takeaways for this Issue.

Could you highlight which part of the extraction definition was incorrect before closing this issue?

ppanopticon commented 3 months ago

Glad to hear.

Well incorrect is a bit of a misnomer. Let's say "not ideal". Fundamentally, there are two (somehow contradictory) paradigms that are used during extraction and that one needs to be aware of:

Retrievable are objects that describe part of a media file (e.g., a segment or the entire file itself). They contain all the Descriptors and potentially Relationships to other Retrievables. Typcially, Retrievable are shared between operators, that is, different operators may see and edit the same instance at different points in time.
Pipelines define sequential streams of Retrievables (that can branch and merge). That stream can be shaped using certain operators (e.g., filters)

That is, the entire object graph generated during an extraction is kept in memory . Every Retrievable that reaches the end is persisted with ALL its relationships, descriptors etc.

Since the extraction process for certain media types define explicit relationships between Retrievables (e.g., video segment to video file), one might end-up in the situation where a Retrievable can be persisted twice because of relationships between them. Hence, when designing the pipeline, one must shape the stream such that only the desired Retrievables make it to the end. In case of a video, it makes sense to persist on a per-file basis.

When designing a pipeline, one should therefore try to think in terms of inputs and outputs:

The VideoDecoder generates one Retrievable per temporal segment. Those contain all the ImageContent and AudioContent. Once a video file has been processed completely, the VideoDecoder emits one Retrievable for the file (without any content). This Retrievable holds a relationship to all its temporal segments.
The features averagecolor, clip, dino, ocr and whisper only operate on retrievable with Content.
The features meta-file and meta-video only process this last (per-file) Retrievable emitted at the end.
Ultimately, we only want the per-file Retrievable to reach the end. Since it holds references to all the segments, the entire graph will be persisted exactly once.

Therefore, the following setup makes sense in this case:

Start with decoding and aggregation (to just keep what you actually need).
Place the averagecolor, clip, dino, ocr and whisper right after the aggregation. These operations can go in parallel, which is why they share a single source.
Afterwards, filter the stream to only emit the per-file Retrievable. In this step we also aggregate the different branches with the COMBINE logic, which makes sure, that a Retrievable is only emitted downstream, once it has been received on all the inputs.
The meta-file and meta-video come after this filter step, since they're anyway only interested in the per-file Retrievable. One could parallelise these, but it's hardly worth it given that these features are very low-effort to generate.
The ouput is then simply the last stage, which contains only per-file Retrievables

I hope this makes sense.

Lanceeeelot commented 3 months ago

Thanks for the detailed explanation! The pipeline is a lot clearer to me now. I removed my example dataset, because i'm not able to share it permanently.

vitrivr / vitrivr-engine

OutOfMemory Error #86