Open cswaney opened 2 weeks ago
As another example, consider here:
if isinstance(profile, SlurmRemote):
logger.debug(f"Connecting to sftp::{profile.user}@{profile.host}")
with Connection(
host=profile.host, user=profile.user
) as conn, conn.sftp() as sftp:
cache_info, home_info = remote_model_info(profile, sftp=sftp)
cache_dir = os.path.join(profile.cache_dir, "models")
logger.debug(f"Searching cache directory {cache_dir}")
try:
model_dirs = sftp.listdir(cache_dir)
for model_dir in filter(lambda x: x.startswith("models--"), model_dirs):
_, namespace, model = model_dir.split("--")
...
Instead of looping over the model directories to list all revisions, we could pull down all the information in one recursive ls
, for example, and then handle loop locally.
Description Currently, services make a lot of redundant system calls. For example, in the following logs we see that the job state is updated twice: during
Service.refresh
and a second time duringService.stop
. The reason this and other unnecessary calls occurs is because a new job is created in all these service methods with only thejob_id
provided to__init__
.Fundamentally, the issue is that we want to know the job state (i.e., status, node, port) at the instance we issue commands to it, but we have to ask the cluster for that. It seems pretty safe to assume
node
andport
are fixed, so we could record those on the service and not need to re-fetch on subsequent calls. Status is clearly not fixed, but do we really need to know it before trying to cancel a job?Fix A simple fix would be to add (mapped or non-mapped?)
node
andremote_port
fields toService
.