tnc-ca-geo / animl-base

Application deployed on field computers to integreate Buckeye X80 wireless camera traps with Animl
Other
4 stars 0 forks source link

Figure out strategy for managing memory and disk space #20

Open nathanielrindlaub opened 3 years ago

nathanielrindlaub commented 3 years ago

We could either (a) delete images immediately once they're uploaded to s3 or (b) set a storage threshold, and once that's reached delete the oldest files as new ones come in.

nathanielrindlaub commented 2 months ago

It appears as though we run into memory issues if the directory being watched for new images (/home/animl/data/<base name>/cameras/) contains too many images. On the Diablo computer, we started maxing out the original PM2 max_memory_refresh threshold of 1GB once there were ~200k images (totaling 38GB). Once I removed all of the image files, the memory usage dropped to almost nothing.

nathanielrindlaub commented 2 months ago

Instead of deleting images we could also just copy them to another directory that's not being watched and that would solve the memory issue but eventually exhaust disk space.

postfalk commented 2 months ago

That is an interesting problem. That raises some questions:

  1. What is the purpose of keeping the images? Backup? Caching? Etc.
  2. Moving them into a different directory would be a good idea.
  3. I assume that the stability of the system is more important than keeping old images. But keeping the images would be nice. What is often done in such cases is to delete files randomly. So if your disk usage reaches let's say 90% you might just randomly delete 1 out of five images before a certain date.
nathanielrindlaub commented 2 months ago

Good questions @postfalk. The only purpose for retaining them would be for backup I suppose, and if we are certain they made it to S3, I don't know how important that is.

I like the idea of the random deletion above some threshold, but if we start to have an incomplete backup and we are confident that any images that are queued for deletion are already in Animl, it's a little hard to picture the scenario in which those backup images would come in handy.

A more plausible scenario is that a base station goes offline so it can't upload images for a long time but is still receiving them. In that case we would want to make sure that there is ample disk space and memory to handle a long internet outage, so I guess maintaining a lot of headroom on both counts would be the best strategy.

postfalk commented 2 months ago

Agreed. However, a good design would be that would do something sensible when we hit the boundary. In which case it is really a decision. Do we want to have an increasing blurry picture of the past or do we just throw it out in favor of new incoming data.

nathanielrindlaub commented 2 months ago

Ok so given that we're going to keep the threshold pretty low (maybe say 25k images), I'm leaning towards just deleting the oldest images once we reach that threshold. If we were to just randomly remove 1 out of every 5 images that would mean we could retain a slightly longer record of the data (i.e. the time extent of the data would be 20% longer) at the cost of that data being 20% blurrier, right?

I don't have strong feelings, but making a hard cutoff and retaining an accurate backup of the the 25k most recent images that were successfully uploaded seems simplest and is a pretty reasonable strategy. What do you think?

postfalk commented 2 months ago

Sure. One useful consideration in the math might be that we usually shot more than one image of the same animal. So the information we retain would be still more precise than if the images would be entirely random. BUT I think deleting the oldest ones is sensible as well.

nathanielrindlaub commented 1 month ago

Ok after a bit more thought I think this is the path forward I am going to pursue. I think one important thing to note is there are two separate but related problems here: the first being that chokidar consumes a lot of memory as the number of watched files grows, and the second is how to manage available disk space during normal operation (in which images are getting uploaded but we may want to retain a backup of uploaded images) and during internet outages (in which images will pile up on the drive and eventually exhaust the disk space).

I think the following solution would address both:

  1. similarly to how we manage images in S3 once they reach the cloud, the images files on the base station would live in one of three directories:
    • an /ingestion directory to which new images get written and that's being watched for new files
    • a /queue directory which files get moved to as soon as they are detected in /ingestion (this would solve the memory issue by keeping the number of watched files very low)
    • a /backup directory to which images that were successfully uploaded get moved
  2. both the /queue and the /backup directories should have some combined maximum storage threshold, which we'll check on some schedule (perhaps every 6 hours).
  3. if they have maxed out the allocated space and there are images in /backup, remove the oldest images in /backup until we're back below the threshold
  4. if they have maxed out the allocated space and there are NO images in /backup, that likely means there's a long-lasting internet outage and the /queue is using all of the available space, so we need to start culling the images in the /queue at random until we're back below the threshold.

Basically, have some fixed amount of disk space shared between the un-uploaded images in the /queue and the already-uploaded images in the /backup, and during normal operation we'd be using pretty much all of that space for backing up the most recent images, but if the base goes offline and the queue starts to build up and we need to make more space, prioritize the deletion of the oldest files in /backup until we've exhausted all backed up images, then delete from the images in the /queue at random.