in03 / proxima

Transcode source media directly from DaVinci Resolve using multiple machines for encoding. Great for creating proxies quickly.
MIT License
50 stars 3 forks source link

refactor: Job dict object as Class #202

Closed in03 closed 1 year ago

in03 commented 2 years ago

Problem

We've been living dangerously for the past couple years since our humble beginings, and we're still passing a dictionary around as our job object. This isn't great. We don't get any type-validation or intellisense that comes with classes. And, ignoring the FP vs OOP debate, we don't get any instance methods or shared state between very related functions. Most of our handlers can join the job class. We're iterating the same jobs over and over again to filter, validate and prompt based on different criteria of the same data, which is very inefficient. We could do most of this on __init__.

Dataclasses

Dataclasses do away with the __init__ section to make working with lots of variables easy. They also implement some common class methods like _repr_ and __eq__ to allow comparing classes. They keep data-centric classes tidy, helping keep bugs to a minimum. Implementing dataclasses will win us:

Unfortunately we have some hurdles to make them work.

Mutability

Currently, jobs have data that can be treated as immutable, mostly in the form of clip properties and project metadata. Some of this metadata we manipulate to reflect user choices and detected paths, notably:

We use the existing keys to save us from extra complications later – these are big dictionaries, and potentially a lot of them to iterate through!

Unfortunately this is slow 🐌 since we're constantly handling mutable data, and there's no static type safety. Just KeyError, TypeError or worst of all NoneType has no attribute ___

Parsing

We need to parse certain values before we use them in certain places. That's exactly what the __init__ section is for, handling the instance variables... but that's a lot of handling! We don't need to handle them all. To avoid handling them all, we could update only the ones we need using the __post_init__ dunder.

Clip properties form the main 'chunk' of a job.

clip_properties = {

    "clip_name": cp["Clip Name"],
    "file_name": cp["File Name"],
    "file_path": cp["File Path"],
    "duration": cp["Duration"],
    "resolution": list(cp["Resolution"]).split("x"),
    "frames": int(cp["Frames"]),
    "fps": float(cp["FPS"]),
    "h_flip": True if cp["H-FLIP"] == "On" else False,
    "v_flip": True if cp["H-FLIP"] == "On" else False,
    "proxy_dir": proxy_dir,
    "start": int(cp["Start"]),
    "end": int(cp["End"]),
    "start_tc": cp["Start TC"],
    "proxy_status": cp["Proxy"],
    "proxy_media_path": cp["Proxy Media Path"]

    if not len(cp["Proxy Media Path"])
    else cp["Proxy Media Path"],

    "end_tc": cp["End TC"],
    "media_pool_item": media_pool_item,

}

Then there's project properties that we add later.

        project=project_name,
        timeline=timeline_name,
        proxy_settings=settings["proxy"],
        paths_settings=settings["paths"],

    )

Proposal

A traditional class, Job, with immutable, SourceMetadata and ProjectMetadata dataclasses; and dynamic variables to easily update mediapoolitem , proxy status and any changing paths.

All types are set to target types, like bool for h_flip and v_flip, list for horizontal and vertical resolution, instead of str for everything. This means the worker gets what it expects, pre-parsed.

Instead of passing proxy and path settings, we pass the entire settings object, then we can validate the object as type SettingsManager

Any handlers that don't prompt the user can be become class methods, like getting a collision-free output path.


from typing import Any, Union

@dataclass(frozen=True)
class SourceMetadata:

    clip_name: str
    file_name: str
    file_path: str
    duration: str
    resolution: list
    frames: int
    fps: float
    h_flip: bool
    v_flip: bool
    proxy_dir: str
    start: int
    end int
    start_tc: str
    proxy_status: str
    proxy_media_path: str
    end_tc: str
    
@dataclass(frozen=True)
class ProjectMetadata:
    project_name: str
    timeline_name: str
    [etc...]

# Create a list of MediaPoolItems that stay in the queuer. 
# We look up our items for post-encode link using file-name.
class MediaPoolRefs():
    def __init__(self):
        self.media_pool_items = list()

    def add_ref(self, filename:str, media_pool_item:PyRemoteObj):
        """filename should reference source media media_pool_item belongs to"""
        if filename not in self.media_pool_items:
            if media_pool_item not in self.media_pool_items:
                self.media_pool_items.append({filename, media_pool_item})
                return
            raise ValueError(f"{media_pool_item} already registered!")
        raise ValueError(f"{filename} already registered!")

    def get_ref(self, filename:str, all:bool=False):

        if all:
            return dict(self.media_pool_items) 

        mpi = self.media_pool_items.get(filename)
        if mpi == None:
            raise ValueError(f"{filename} not registered!")
        return mpi

    def drop_refs(self):
        self.media_pool_items = []

class Job():
    def __init__(
        self, 
        src_meta: SourceMetadata, 
        proj_meta: ProjectMetadata,
        settings: SettingsManager,
    ):

        # Get data
        self.source_metadata = source_metadata
        self.project_metadata = project_metadata

        # Get dynamic vars
        self.output_path = self._get_output_path()
        self.online = _is_online()
        self.orphan = self._is_orphan()
        self.linkable = self._is_linkable()

    def is_online(self):
        status = self.source_metadata.proxy_status
        switch = {
            "Offline": None,
            "None": False,
        }
        # Status is resolution (depends on source-res) when online
        return switch.get(status, True)

    # Refactor 'handle_file_collisions'
    def _get_output_path(self):
        # increment_file_if_exist_blah...
        return output_path

    def is_orphan(self)
        if not self.proxy_status:
            # check if orphaned blah...
        return True

media_pool_refs = MediaPoolRefs()

# approximately like so
def get_proxy_jobs(r_, media_pool_ref, settings):
    for mpi in mpis:
        cp = mpi.GetClipProperty()
        # bunch of filtering...

        project_metadata = ProjectMetadata(r_.project, r_.timeline)

        # Some picking and parsing, need to get hands dirty
        source_metadata = SourceMetadata(

            clip_name = cp["Clip Name"],
            file_name = cp["File Name"],
            file_path = cp["File Path"],
            duration = cp["Duration"],
            resolution = list(cp["Resolution"]).split("x"),
            frames = int(cp["Frames"]),
            fps = float(cp["FPS"]),
            h_flip = True if cp["H-FLIP"] == "On" else False,
            v_flip = True if cp["H-FLIP"] == "On" else False,
            proxy_status = cp["Proxy"],
            proxy_media_path = cp["Proxy Media Path"]
            proxy_dir = proxy_dir,
            start = int(cp["Start"]),
            end = int(cp["End"]),
            start_tc = cp["Start TC"],
            end_tc = cp["End TC"]            
        )

        # Define job
        job = Job(project_metadata, source_metadata, settings)

        # Add media pool item reference
        media_pool_ref.add_mpi(source_metadata.file_name, mpi)

Moving forward from here

At some point we'll want to introduce chunking. Since the resolve module actually instantiates all the Jobs with get_proxy_jobs(), it returns them to the queuer class. If we want them to be chunked, we need to re instantiate each job, since every job should be a chunk. We don't want to re-init everything again by making new classes...

We can merge project_metadata dataclass with source_metadata. That way we don't have to iterate each source_metadata object and add the same project_metadata data.

We can make MediaPoolRefs a Singleton class. That way we can register all the media_pool_items from the Resolve class and just import MediaPoolRefs in the queuer module. The reason why we can't ship them with source_metadata is that they are unhashable, referencing a pointer in memory to a proprietary API, and will fail transport in Celery. We also don't want to remove them, and we want easy access to them from other modules. A singleton provides that.

There seems to be some discouragement of singletons in Python. Many call them an anti-pattern. We may be able to substitute them for a plain module? All global code and functions. We'd just have to be careful that we are indeed importing the same module instance whenever we do.

If that works out well for us, it may be an idea to re-evaluate SettingsManager as a plain module.

Perhaps instead of returning jobs when we call get_proxy_jobs, we can add get_proxy_jobs and other sister functions to the Resolve class. Then we can save any variables we need to that self and call them like so:

# job module

class Job():
    # All that job stuff from the 
    # Job class code block before

def construct_jobs(
    self, 
    settings:SettingsManager, 
    source_metadata: SourceMetadata,
    chunky:bool=True
)

    if chunky:
        sm = sm.i_like_em_chunky(seconds=30)

    jobs = [Job(x, settings) for x in sm]
    self.jobs = jobs
    return jobs
from resolve import Resolve
from SettingsManager import settings
from job import Job, MediaPoolRefs

r = Resolve(settings)
sm = r.get_source_metadata(make_mpi_refs=True)
jobs = job.construct_jobs(settings, sm, chunky=True)

mpr = MediaPoolRefs()
all_refs = mpr.get_ref(all=True)
in03 commented 1 year ago

/cib

github-actions[bot] commented 1 year ago

Branch feature/issue-202-Refactor-job-dict-object-as-Class created!