roscisz / TensorHive

Tool for managing exclusive GPU access for distributed machine learning workloads
Apache License 2.0
154 stars 25 forks source link

Reservation checks look at system users #354

Open tboinski opened 2 years ago

tboinski commented 2 years ago

The system reports reservation violations for system users like gdm, root or even "None" user. This generates unnecessary spam.

roscisz commented 2 years ago

Thanks for your feedback.

A quick fix could be achieved by editing tensorhive/core/managers/InfrastructureManager.py, in ignored_processes there are hard-coded names of processes that should be ignored. Maybe in your setup the Xorg processes have a different name and you could add them. Please let me know if it helps.

Anyhow, I will leave the issue open with the following comment:

It should be possible to provide a custom whitelist of system users that would be ignored by infrastructure manager... or only by protection service? The list should be configurable via configuration files.

tboinski commented 2 years ago

The question is should this be a process list or a list of usernames, as processes can be system dependent. In m setup the likely culprit was /usr/bin/gnome-shell.

tboinski commented 2 years ago

In current develop branch the system behave differently. Currently the protection service asks the User model for user email. If the violator is the system user (e.g. root or gdm) the User model throws an exception and even admin emails are not sent.