nod-ai / shark-ai

SHARK Inference Modeling and Serving
Apache License 2.0
13 stars 29 forks source link

Manage page allocations through a PageAllocation object, rather than straight in InferenceExecRequest #607

Open renxida opened 6 days ago

renxida commented 6 days ago

To manage the lifecycle of page allocations for an inference request, it may be important to use an interface to encapsulate:

class PageAllocation(ABC):
    """
    Abstract base class for page allocations in the cache.
    Subclasses only need to implement the core allocation methods.
    """
    @abstractmethod
    def get_page_list(self) -> List[PageInfo]:
        """Returns the list of pages that were allocated."""
        pass

    @abstractmethod
    def publish_pages(self) -> None:
        """
        Makes pages available to other requests after writing is complete.
        Associates tokens with pages and marks them as ready for reading.
        """
        pass

    @abstractmethod
    def release_pages(self) -> None:
        """
        Releases the allocation's reference to pages.
        Pages become eligible for eviction when their reference count reaches zero.
        """
        pass
renxida commented 6 days ago

We currently have this in InferenceExecRequest:

https://github.com/nod-ai/shark-ai/blob/0e74c394037784e46a1f898078d3142b04d91662/shortfin/python/shortfin_apps/llm/components/messages.py#L55-L81

renxida commented 6 days ago

The new methods would correspond to cache_page_indices and free_cache_pages.

The creation of a cache allocation should be used in lock_initial_cache_pages. lock_additional_cache_pages should acquire another PageAllocation, then destroy the original one - - due to caching, the newly acquired pages would overlap maximally with the existing pages.

renxida commented 6 days ago

Implementing on #608