Open nathanielrindlaub opened 3 years ago
It appears as though we run into memory issues if the directory being watched for new images (/home/animl/data/<base name>/cameras/
) contains too many images. On the Diablo computer, we started maxing out the original PM2 max_memory_refresh
threshold of 1GB once there were ~200k images (totaling 38GB). Once I removed all of the image files, the memory usage dropped to almost nothing.
Instead of deleting images we could also just copy them to another directory that's not being watched and that would solve the memory issue but eventually exhaust disk space.
That is an interesting problem. That raises some questions:
Good questions @postfalk. The only purpose for retaining them would be for backup I suppose, and if we are certain they made it to S3, I don't know how important that is.
I like the idea of the random deletion above some threshold, but if we start to have an incomplete backup and we are confident that any images that are queued for deletion are already in Animl, it's a little hard to picture the scenario in which those backup images would come in handy.
A more plausible scenario is that a base station goes offline so it can't upload images for a long time but is still receiving them. In that case we would want to make sure that there is ample disk space and memory to handle a long internet outage, so I guess maintaining a lot of headroom on both counts would be the best strategy.
Agreed. However, a good design would be that would do something sensible when we hit the boundary. In which case it is really a decision. Do we want to have an increasing blurry picture of the past or do we just throw it out in favor of new incoming data.
Ok so given that we're going to keep the threshold pretty low (maybe say 25k images), I'm leaning towards just deleting the oldest images once we reach that threshold. If we were to just randomly remove 1 out of every 5 images that would mean we could retain a slightly longer record of the data (i.e. the time extent of the data would be 20% longer) at the cost of that data being 20% blurrier, right?
I don't have strong feelings, but making a hard cutoff and retaining an accurate backup of the the 25k most recent images that were successfully uploaded seems simplest and is a pretty reasonable strategy. What do you think?
Sure. One useful consideration in the math might be that we usually shot more than one image of the same animal. So the information we retain would be still more precise than if the images would be entirely random. BUT I think deleting the oldest ones is sensible as well.
Ok after a bit more thought I think this is the path forward I am going to pursue. I think one important thing to note is there are two separate but related problems here: the first being that chokidar consumes a lot of memory as the number of watched files grows, and the second is how to manage available disk space during normal operation (in which images are getting uploaded but we may want to retain a backup of uploaded images) and during internet outages (in which images will pile up on the drive and eventually exhaust the disk space).
I think the following solution would address both:
/ingestion
directory to which new images get written and that's being watched for new files/queue
directory which files get moved to as soon as they are detected in /ingestion
(this would solve the memory issue by keeping the number of watched files very low)/backup
directory to which images that were successfully uploaded get moved/queue
and the /backup
directories should have some combined maximum storage threshold, which we'll check on some schedule (perhaps every 6 hours). /backup
, remove the oldest images in /backup
until we're back below the threshold/backup
, that likely means there's a long-lasting internet outage and the /queue
is using all of the available space, so we need to start culling the images in the /queue
at random until we're back below the threshold.Basically, have some fixed amount of disk space shared between the un-uploaded images in the /queue
and the already-uploaded images in the /backup
, and during normal operation we'd be using pretty much all of that space for backing up the most recent images, but if the base goes offline and the queue starts to build up and we need to make more space, prioritize the deletion of the oldest files in /backup
until we've exhausted all backed up images, then delete from the images in the /queue
at random.
We could either (a) delete images immediately once they're uploaded to s3 or (b) set a storage threshold, and once that's reached delete the oldest files as new ones come in.