Automatic `code` object creation can fail if another process writes to the subshell

ConradJohnston commented 4 weeks ago

When aiida-shell runs with some arbitrary command, like cat, an AiiDA code object is created automatically.

Part of this process is calling which to get the path to the arbitrary command: https://github.com/sphuber/aiida-shell/blob/2ab82195b755cfc6d814439a72670632d2e025ff/src/aiida_shell/launch.py#L131

On my machine, I've seen a helpful warning from a sysadmin process being written to the subshell on invocation, and thus this bonus message get prepended to the output of which. This then ends up in the _aiidasubmit.sh script written by _prepareforsubmission.

An example of the created faulty code is like this:

> verdi code show foo
-----------------------  ---------------------------------------------------------------
PK                       42
UUID                     8b7a85b4-54b4-4080-b270-3ecf58163e18
Type                     core.code.installed.shell
Label                    obabel
Description
Default calc job plugin  core.shell
Use double quotes        False
With mpi
Prepend text
Append text
Computer                 localhost (localhost), pk: 2
Filepath executable      WARNING:  helpful sysadmin message
                         /some/correct/path/foo

sphuber commented 4 weeks ago

Thank Mr. Johnston for the report. As a first workaround you could opt to set the resolve_command argument to False to disable the executable expansion using which.

As for a proper fix, I am wondering what the correct behavior should be. First step is to detect the problem. I could check if the captured stdout contains any newlines, which should guarantee there is spurious output. But it may also be on the same line, so I would have to check for spaces perhaps. But then again, on some systems, certain executable paths may contain spaces? Although probably this is very unlikely. Still, detecting spurious output is already not trivial to get right a 100%.

And then there is how to act on it. Should we just raise a warning, or try to fix it by automatically extracting what should be the executable path? Not sure if the latter can be done reliably. Any ideas?

ConradJohnston commented 4 weeks ago

Yes, I had the same struggle when thinking about fixes. I also pondered whether it is necessary to fix it - is it the reasonable to expect that there will be no additional output on SSH'ing to a machine? I think I can argue either way. These informational messages are fairly common on managed trad HPC.

If we can send more than one command over the transport, and guarentee that they are run in the same shell, then perhaps you can specify your own delimiter - like echo DELIMIT ; which foo. Then the admin message is clearly separated from the path, and you can safely include all characters as being part of the path?

sphuber commented 3 weeks ago

is it the reasonable to expect that there will be no additional output on SSH'ing to a machine?

Not really. That is also why verdi computer test checks exactly for this. If the shell contains additional output, it warns the user, becuase this can also interfere with other parts of AiiDA where the transport is used to execute commands.

If we can send more than one command over the transport, and guarentee that they are run in the same shell, then perhaps you can specify your own delimiter - like echo DELIMIT ; which foo. Then the admin message is clearly separated from the path, and you can safely include all characters as being part of the path?

I think this is indeed a viable solution:

In [1]: from aiida.transports.plugins.local import LocalTransport

In [2]: with LocalTransport(use_login_shell=True) as t:
   ...:     exit_code, stdout, stderr = t.exec_command_wait('echo "DELIMITER";which cat; echo "/DELIMITER"')
   ...:     print(stdout)
   ...:
LOAD PROFILE
DELIMITER
/usr/bin/cat
/DELIMITER

The LOAD PROFILE message echoed by using the login shell (an echo statement I quickly added to my ~/.profile) is cleanly isolated and we should be able to parse the output from the which command.

That being said, I am starting to wonder whether we actually want this feature at all. I already had to add the resolve_command argument in order to disable it because there are use cases where the resolution will fail. For example where the binary is only available after a module load. This would be passes through the preprend_text at runtime, but the which command would fail to find it.

The original reasoning was to add some validation when a user ran a shell job in case they made a typo in the command name. But perhaps, instead of anticipating it with all the associated pitfalls, it is best to leave determining a non existing command to the parser that will just parse that from the job's output. Since we are always expecting a bash shell on the other end, I think the message should always be command not found in that case, which the ShellParser could simply check for an return the corresponding exit code.

Do you still think a benefit of having the check in the launch_shell_job?

sphuber / aiida-shell

Automatic `code` object creation can fail if another process writes to the subshell #100