saleyn / erlexec

Execute and control OS processes from Erlang/OTP
https://hexdocs.pm/erlexec/readme.html
Other
527 stars 139 forks source link

Unable to find the exec-port binary #82

Closed whitfin closed 7 years ago

whitfin commented 8 years ago

I'm running builds on Travis CI and erlexec is compiling x86_64-linux-gnu-gnu, but the runtime is trying to read x86_64-unknown-linux-gnu.

Is it possible to make reading of the binary resilient to errors like this? E.g. there should only ever be a single binary, so you could just list the directory and pull the first one.

saleyn commented 8 years ago

What is the result of a call to erlang:system_info(system_architecture). on that system?

The reason the binary port program is put into the architecture-specific directory is that some installations prefer to have a shared repository of libs (e.g. on NFS) that is accessed by multiple machines in local network with different architectures.

Additionally, it seems that this particular problem is caused by the changes introduced by the pull request https://github.com/saleyn/erlexec/pull/63, so maybe @rstreif could make an adjustment based on this, so that this fix doesn't break the cross-environment support?

whitfin commented 8 years ago

@saleyn it is indeed reported as x86_64-unknown-linux-gnu but somehow the compilation didn't use that. Can't the compiler compile to erlang:system_info(system_architecture) as well?

saleyn commented 8 years ago

The run-time uses that call (see https://github.com/saleyn/erlexec/blob/master/src/exec.erl#L534), but the compilation was modified by @rstreif to do this (https://github.com/saleyn/erlexec/blob/master/rebar.config.script#L16 - lines 4 through line 18) for cross-environment support, which is likely the reason you are experiencing this issue. You are welcome to propose a PR change to rebar.config.script that works in your case.

whitfin commented 8 years ago

@saleyn I don't really know how to fix the issue without simply reverting that PR. It seems that the change is assuming you're running on a different system to the one you're building on, and introduces some flaky(?) logic.

It seems to make sense to simply keep using erlang:system_info(system_architecture) with the assumption that you're compiling on the system you're going to run on, because that would guarantee that they're in sync.

@rstreif any ideas on a middle ground between your changes and what was there before?

whitfin commented 8 years ago

@saleyn would you be open for the run-time call to do a lookup to figure out what the parent dir of exec-port is? So at run-time it doesn't care what it was called, it just figures out the name automatically. Not sure if that would cause an issue?

E.g:

Ports = filelib:wildcard("*/exec-port", "priv"), First = hd(Ports), filename:dirname(First).

That way we can safely use erlang:system_info(system_architecture) and the app should still figure out @rstreif's case.

saleyn commented 8 years ago

I am certainly open to idea of modifying the default: https://github.com/saleyn/erlexec/blob/master/src/exec.erl#L533, so that if it doesn't find exec-port at that location, it would search elsewhere. However the preference would be to synchronize the compilation/run-time to some common ground. Perhaps instead of using <arch>-<vendor>-linux-gnu, switch to <arch>-<vendor>?

It's actually not that uncommon that the code doesn't get compiled on production systems, so there is always a chance of a mismatch when a packaged release is deployed on some machine that has a different architecture (i.e. 32bit vs 64bit, etc).

whitfin commented 8 years ago

@saleyn if that's the case then I'd suggest simply doing a search and having that be the only behaviour - I don't think there's a good way to detect all cases otherwise :(

With regards to synchronisation, using erlang:system_info(system_architecture) to define the arch would (I think) be the best option, and then just relying on the lookup to find the port instead of trying to read a specific path. It should suffice for most cases (aside from cross-compiling), and it's a pretty simple solution.

If that sounds good to you, I can work on a PR (although it might take me a while since I come from Elixir-land and haven't written much Erlang :p)

saleyn commented 8 years ago

That was the original behavior prior to @rstreif's PR. Let's wait for him to reply, as reverting that PR would break compatibility with his deployment.

rstreif commented 8 years ago

Sorry for the delayed response. I am traveling.

@zackehh What is the environment you are compiling in?

erlang:system_info(system_architecture) does not work when cross-building as the Erlang runtime is the one of the host and not of the target. In that case the compiler correctly reports the target The old behavior using erlang:system_info(system_architecture) is still there and used if the CXX environment variable is not set. In your case it is set to a compiler but that compiler is not reporting the vendor string. The fix would be to tokenize Target, see if the second element in the list is `linux' in which case the vendor string is missing, and then insert 'unknown'.

whitfin commented 7 years ago

@rstreif it's the default Ubuntu image, I dunno if that's handrolled or something.

In addition, this lib is now broken on the latest macOS with Erlang 19 as x86_64-apple-darwin15.5.0 is generated, but it tries to load x86_64-apple-darwin16.0.0.

I still believe that doing a search is the better option, it's far more resistant to change.

/cc @saleyn

FWIW my proposal is similar to https://github.com/saleyn/erlexec/pull/85 but rather than using erlang:system_info(system_architecture) to generate the dir to look in, simply do a search for a single nested exec-port. This would support @rstreif's changes and avoid the clashing when looking up. I'll have a PR for this shortly (just verifying things like tests first, I come from Elixir-land so I'm not too familiar with working on Erlang directly).

whitfin commented 7 years ago

Here is my proposed fix: https://github.com/saleyn/erlexec/pull/86

Verified that it finds the port correctly, even when the arch lookup is generating the wrong name. If someone could take a look over it, I'd appreciate it.