princeton-ddss / blackfish

Machine learning as a service (MLaaS) for busy researchers
https://princeton-ddss.github.io/blackfish/
MIT License
0 stars 0 forks source link

Reduce the number of system calls #74

Open cswaney opened 2 weeks ago

cswaney commented 2 weeks ago

Description Currently, services make a lot of redundant system calls. For example, in the following logs we see that the job state is updated twice: during Service.refresh and a second time during Service.stop. The reason this and other unnecessary calls occurs is because a new job is created in all these service methods with only the job_id provided to __init__.

DEBUG:    Checking status of service b8928edf-74d8-4245-afee-8612a0e04c06. Current status is HEALTHY. [2024-11-10 15:36:24.135]
DEBUG:    Updating job state (job_id=60293666). [2024-11-10 15:36:24.136]
DEBUG:    The current job state is: TIMEOUT (job_id=$60293666) [2024-11-10 15:36:24.872]
DEBUG:    Service b8928edf-74d8-4245-afee-8612a0e04c06 has a timed out job. Setting status to TIMEOUT and stopping the service. [2024-11-10 15:36:24.873]
INFO:     Stopping service b8928edf-74d8-4245-afee-8612a0e04c06 [2024-11-10 15:36:24.873]
DEBUG:    Updating job state (job_id=60293666). [2024-11-10 15:36:24.873]
DEBUG:    The current job state is: TIMEOUT (job_id=$60293666) [2024-11-10 15:36:25.541]
DEBUG:    Canceling job 60293666 [2024-11-10 15:36:25.542]

Fundamentally, the issue is that we want to know the job state (i.e., status, node, port) at the instance we issue commands to it, but we have to ask the cluster for that. It seems pretty safe to assume node and port are fixed, so we could record those on the service and not need to re-fetch on subsequent calls. Status is clearly not fixed, but do we really need to know it before trying to cancel a job?

Fix A simple fix would be to add (mapped or non-mapped?) node and remote_port fields to Service.

cswaney commented 1 week ago

As another example, consider here:

    if isinstance(profile, SlurmRemote):
        logger.debug(f"Connecting to sftp::{profile.user}@{profile.host}")
        with Connection(
            host=profile.host, user=profile.user
        ) as conn, conn.sftp() as sftp:
            cache_info, home_info = remote_model_info(profile, sftp=sftp)
            cache_dir = os.path.join(profile.cache_dir, "models")
            logger.debug(f"Searching cache directory {cache_dir}")
            try:
                model_dirs = sftp.listdir(cache_dir)
                for model_dir in filter(lambda x: x.startswith("models--"), model_dirs):
                    _, namespace, model = model_dir.split("--")
                   ...

Instead of looping over the model directories to list all revisions, we could pull down all the information in one recursive ls, for example, and then handle loop locally.