tnc-ca-geo / animl-base

Application deployed on field computers to integreate Buckeye X80 wireless camera traps with Animl
Other
4 stars 0 forks source link

Figure out strategy for purging images from watched directory #20

Open nathanielrindlaub opened 3 years ago

nathanielrindlaub commented 3 years ago

We could either (a) delete images immediately once they're uploaded to s3 or (b) set a storage threshold, and once that's reached delete the oldest files as new ones come in.

nathanielrindlaub commented 6 days ago

It appears as though we run into memory issues if the directory being watched for new images (/home/animl/data/<base name>/cameras/) contains too many images. On the Diablo computer, we started maxing out the original PM2 max_memory_refresh threshold of 1GB once there were ~200k images (totaling 38GB). Once I removed all of the image files, the memory usage dropped to almost nothing.

nathanielrindlaub commented 6 days ago

Instead of deleting images we could also just copy them to another directory that's not being watched and that would solve the memory issue but eventually exhaust disk space.

postfalk commented 6 days ago

That is an interesting problem. That raises some questions:

  1. What is the purpose of keeping the images? Backup? Caching? Etc.
  2. Moving them into a different directory would be a good idea.
  3. I assume that the stability of the system is more important than keeping old images. But keeping the images would be nice. What is often done in such cases is to delete files randomly. So if your disk usage reaches let's say 90% you might just randomly delete 1 out of five images before a certain date.
nathanielrindlaub commented 6 days ago

Good questions @postfalk. The only purpose for retaining them would be for backup I suppose, and if we are certain they made it to S3, I don't know how important that is.

I like the idea of the random deletion above some threshold, but if we start to have an incomplete backup and we are confident that any images that are queued for deletion are already in Animl, it's a little hard to picture the scenario in which those backup images would come in handy.

A more plausible scenario is that a base station goes offline so it can't upload images for a long time but is still receiving them. In that case we would want to make sure that there is ample disk space and memory to handle a long internet outage, so I guess maintaining a lot of headroom on both counts would be the best strategy.

postfalk commented 6 days ago

Agreed. However, a good design would be that would do something sensible when we hit the boundary. In which case it is really a decision. Do we want to have an increasing blurry picture of the past or do we just throw it out in favor of new incoming data.

nathanielrindlaub commented 5 days ago

Ok so given that we're going to keep the threshold pretty low (maybe say 25k images), I'm leaning towards just deleting the oldest images once we reach that threshold. If we were to just randomly remove 1 out of every 5 images that would mean we could retain a slightly longer record of the data (i.e. the time extent of the data would be 20% longer) at the cost of that data being 20% blurrier, right?

I don't have strong feelings, but making a hard cutoff and retaining an accurate backup of the the 25k most recent images that were successfully uploaded seems simplest and is a pretty reasonable strategy. What do you think?

postfalk commented 5 days ago

Sure. One useful consideration in the math might be that we usually shot more than one image of the same animal. So the information we retain would be still more precise than if the images would be entirely random. BUT I think deleting the oldest ones is sensible as well.