Closed dnigh closed 3 years ago
I don't exactly understand how is it supposed to work and what problem is it solves, nonetheless, that's an interesting question.
In one of my workflows I have a batch processor that looks at a list of files, checks for a lock file in a temp directory (filename.lock usually), if it finds a lock file it moves on to the next file in the list until it gets one without a lock file and starts processing it (creating a lock file for the current file). This allows me to run multiple instances of the batch processor on the same machine when processing loads are low as well as instances on connected machines that share the same workspace via sshfs. The concurrent instances don't know anything else about what the other instances are doing except via the lock files. Av1an already does something like this on a local scale. Expanding this to track what's already in progress outside of the initial instance would allow for leveraging multiple machines to work on a single file. I haven't dug into av1an too deep but I could try and take a crack at it if time allows, although I'm not particularly well versed in python or rust.
So av1an would need to check the temp folder and look through all the split folder to find a file that does not end in .lock, once one is found it will append .lock to it and attempt to encode it. Would that be the workflow for something like this? We would need to remove the .lock from it whenever there is an error in the encoding prior to exiting. Not sure how we would solve the issue of .lock remaining if something kills av1an since you are not able to validate that the encode was done in any way.
I took a cursory look at Manager.py and Queue.py and might we be able to piggy back on done.json? Adding another object that gets elements added at the beginning of a chunk encode and deleted when the chunk finishes? That doesn't address the problem of an interruption but I can see the primary instance clearing the processing(for lack of a better name) object on resume and secondary instances having another launch option to tell them to not clear that object on launch and not to merge the chunks when there's nothing left in the queue. The primary instance would also need to wait idle if a secondary instance is working on the last chunk, so I guess a loop that polls the processing object to see if it's empty? I'm testing adding the processing object in a fork right now but haven't had a chance to iron out the wrinkles
Oh to clarify my previous workflow, the file to be worked on doesn't get renamed but an empty file with the same name but ends in .lock gets created once work starts. Both test1.txt and test1.txt.lock exist if test1.txt is being worked on. So when another instance goes to work on test1.txt it first checks if test1.txt.lock exists, if so it moves on to test2.txt. Also temp copies were made for the actual work to be done on and shuffled around when finished while failure cases were automatically moved to another location for manual remediation. Not the most elegant but it got the job done.
Would it be possible to implement distributed encoding using the resume function and filesystem hints via lock files for actively encoding scenes on a mounted network filesystem (sshfs, etc.)?