in03 / proxima

Transcode source media directly from DaVinci Resolve using multiple machines for encoding. Great for creating proxies quickly.
MIT License
50 stars 3 forks source link

refactor: Job dict object as Class #234

Closed github-actions[bot] closed 1 year ago

github-actions[bot] commented 1 year ago
Original issue description # Problem We've been living dangerously for the past couple years since our humble beginings, and we're still passing a dictionary around as our job object. This isn't great. We don't get any type-validation or intellisense that comes with classes. And, ignoring the FP vs OOP debate, we don't get any instance methods or shared state between **very** related functions. Most of our handlers can join the job class. We're iterating the same jobs over and over again to filter, validate and prompt based on different criteria of the same data, which is very inefficient. We could do most of this on `__init__`. ## Dataclasses Dataclasses do away with the `__init__` section to make working with lots of variables easy. They also implement some common class methods like `_repr_` and `__eq__` to allow comparing classes. They keep data-centric classes tidy, helping keep bugs to a minimum. Implementing dataclasses will win us: - static type safety - a single object to structure our data in instead of a loosely defined dict, meaning: - easier comparisons - more solid definitions Unfortunately we have some hurdles to make them work. ### Mutability Currently, jobs have data that can be treated as immutable, mostly in the form of clip properties and project metadata. Some of this metadata we manipulate to reflect user choices and detected paths, notably: - `proxy` - Set to None to rerender offline - `proxy_media_path` - set to determined output path We use the existing keys to save us from extra complications later – these are big dictionaries, and potentially a lot of them to iterate through! Unfortunately this is slow 🐌 since we're constantly handling mutable data, and there's no static type safety. Just `KeyError`, `TypeError` or worst of all `NoneType has no attribute ___` ### Parsing We need to parse certain values before we use them in certain places. That's exactly what the `__init__` section is for, handling the instance variables... but that's a lot of handling! We don't need to handle them all. To avoid handling them all, we could update only the ones we need using the `__post_init__` dunder. Clip properties form the main 'chunk' of a job. ```python clip_properties = {     "clip_name": cp["Clip Name"],     "file_name": cp["File Name"],     "file_path": cp["File Path"],     "duration": cp["Duration"],     "resolution": list(cp["Resolution"]).split("x"),     "frames": int(cp["Frames"]),     "fps": float(cp["FPS"]),     "h_flip": True if cp["H-FLIP"] == "On" else False,     "v_flip": True if cp["H-FLIP"] == "On" else False,     "proxy_dir": proxy_dir,     "start": int(cp["Start"]),     "end": int(cp["End"]),     "start_tc": cp["Start TC"],     "proxy_status": cp["Proxy"],     "proxy_media_path": cp["Proxy Media Path"]     if not len(cp["Proxy Media Path"])     else cp["Proxy Media Path"],     "end_tc": cp["End TC"],     "media_pool_item": media_pool_item, } ``` Then there's project properties that we add later. ```python         project=project_name,         timeline=timeline_name,         proxy_settings=settings["proxy"],         paths_settings=settings["paths"],     ) ``` ## Proposal A traditional class, `Job`, with immutable, `SourceMetadata` and `ProjectMetadata` dataclasses; and dynamic variables to easily update `mediapoolitem` , proxy status and any changing paths. All types are set to target types, like `bool` for `h_flip` and `v_flip`, list for horizontal and vertical `resolution`, instead of `str` for everything. This means the worker gets what it expects, pre-parsed. Instead of passing proxy and path settings, we pass the entire settings object, then we can validate the object as type `SettingsManager` Any handlers that don't prompt the user can be become class methods, like getting a collision-free output path. ```python from typing import Any, Union @dataclass(frozen=True) class SourceMetadata:     clip_name: str     file_name: str     file_path: str     duration: str     resolution: list     frames: int     fps: float     h_flip: bool     v_flip: bool     proxy_dir: str     start: int     end int     start_tc: str     proxy_status: str     proxy_media_path: str end_tc: str     @dataclass(frozen=True) class ProjectMetadata: project_name: str timeline_name: str [etc...] # Create a list of MediaPoolItems that stay in the queuer. # We look up our items for post-encode link using file-name. class MediaPoolRefs(): def __init__(self): self.media_pool_items = list() def add_ref(self, filename:str, media_pool_item:PyRemoteObj): """filename should reference source media media_pool_item belongs to""" if filename not in self.media_pool_items: if media_pool_item not in self.media_pool_items: self.media_pool_items.append({filename, media_pool_item}) return raise ValueError(f"{media_pool_item} already registered!") raise ValueError(f"{filename} already registered!") def get_ref(self, filename:str, all:bool=False): if all: return dict(self.media_pool_items) mpi = self.media_pool_items.get(filename) if mpi == None: raise ValueError(f"{filename} not registered!") return mpi def drop_refs(self): self.media_pool_items = [] class Job(): def __init__( self, src_meta: SourceMetadata, proj_meta: ProjectMetadata, settings: SettingsManager, ): # Get data self.source_metadata = source_metadata self.project_metadata = project_metadata # Get dynamic vars self.output_path = self._get_output_path() self.online = _is_online() self.orphan = self._is_orphan() self.linkable = self._is_linkable() def is_online(self): status = self.source_metadata.proxy_status switch = { "Offline": None, "None": False, } # Status is resolution (depends on source-res) when online return switch.get(status, True) # Refactor 'handle_file_collisions' def _get_output_path(self): # increment_file_if_exist_blah... return output_path def is_orphan(self) if not self.proxy_status: # check if orphaned blah... return True media_pool_refs = MediaPoolRefs() # approximately like so def get_proxy_jobs(r_, media_pool_ref, settings): for mpi in mpis: cp = mpi.GetClipProperty() # bunch of filtering... project_metadata = ProjectMetadata(r_.project, r_.timeline) # Some picking and parsing, need to get hands dirty source_metadata = SourceMetadata(             clip_name = cp["Clip Name"],             file_name = cp["File Name"],             file_path = cp["File Path"],             duration = cp["Duration"],             resolution = list(cp["Resolution"]).split("x"),             frames = int(cp["Frames"]),             fps = float(cp["FPS"]),             h_flip = True if cp["H-FLIP"] == "On" else False,             v_flip = True if cp["H-FLIP"] == "On" else False,             proxy_status = cp["Proxy"],             proxy_media_path = cp["Proxy Media Path"]             proxy_dir = proxy_dir,             start = int(cp["Start"]),             end = int(cp["End"]),             start_tc = cp["Start TC"],             end_tc = cp["End TC"]                    ) # Define job job = Job(project_metadata, source_metadata, settings) # Add media pool item reference media_pool_ref.add_mpi(source_metadata.file_name, mpi) ``` ## Moving forward from here At some point we'll want to introduce chunking. Since the `resolve` module actually instantiates all the `Jobs` with `get_proxy_jobs()`, it returns them to the queuer class. If we want them to be chunked, we need to re instantiate each job, since every job should be a chunk. We don't want to re-init everything again by making new classes... We can merge `project_metadata` dataclass with `source_metadata`. That way we don't have to iterate each `source_metadata` object and add the same `project_metadata` data. We can make `MediaPoolRefs` a Singleton class. That way we can register all the `media_pool_items` from the `Resolve` class and just import `MediaPoolRefs` in the `queuer` module. The reason why we can't ship them with `source_metadata` is that they are unhashable, referencing a pointer in memory to a proprietary API, and will fail transport in Celery. We also don't want to remove them, and we want easy access to them from other modules. A singleton provides that. There seems to be some discouragement of singletons in Python. Many call them an anti-pattern. We may be able to substitute them for a plain module? All global code and functions. We'd just have to be careful that we are indeed importing the same module instance whenever we do. If that works out well for us, it may be an idea to re-evaluate SettingsManager as a plain module. Perhaps instead of returning `jobs` when we call `get_proxy_jobs`, we can add `get_proxy_jobs` and other sister functions to the `Resolve` class. Then we can save any variables we need to that `self` and call them like so: ```python # job module class Job(): # All that job stuff from the # Job class code block before def construct_jobs( self, settings:SettingsManager, source_metadata: SourceMetadata, chunky:bool=True ) if chunky: sm = sm.i_like_em_chunky(seconds=30) jobs = [Job(x, settings) for x in sm] self.jobs = jobs return jobs ``` ```python from resolve import Resolve from SettingsManager import settings from job import Job, MediaPoolRefs r = Resolve(settings) sm = r.get_source_metadata(make_mpi_refs=True) jobs = job.construct_jobs(settings, sm, chunky=True) mpr = MediaPoolRefs() all_refs = mpr.get_ref(all=True) ```

closes #202