saleyn / erlexec

Execute and control OS processes from Erlang/OTP
https://hexdocs.pm/erlexec/readme.html
Other
527 stars 139 forks source link

Make runtime lookups more resilient #86

Closed whitfin closed 7 years ago

whitfin commented 7 years ago

This PR makes the lookup of exec-port at runtime more resilient.

This should resolve #82 in a way that doesn't cause an issue with #63. Basically it makes the lookup handle situations where the arch has changed, without just breaking.

I believe this should also work alongside #84, but I'm not sure? Can't see why not.

saleyn commented 7 years ago

This solution might pick the executable compiled for the wrong architecture if there's more than one in the build.

whitfin commented 7 years ago

@saleyn that's true but I'd argue that it'd never happen unless you were doing something weird (why would you ever have two different archs in there?).

This is the safest way, because the logic which names the exec-port directory is proven inconsistent - it is broken on Travis, and now macOS Sierra/Erlang 19. Both of those use cases are resolved with this, and so are all existing use cases aside from the one you mentioned.

It takes the name of the directory out of the equation (as it should be, as there should only be one executable), which naturally makes it more resilient.

If you have any suggestions I'm glad to hear them, but I maintain that the lookup as it currently exists is fundamentally flawed (if only because it relies on the state of CXX which is totally ridiculous given this is an Erlang lib).

saleyn commented 7 years ago

Unfortunately that is a hard requirement. Some cluster installations have libs shared on NFS with ports for different architectures precompiled, with ERL_LIBS pointing to NFS.

whitfin commented 7 years ago

@saleyn but you see the issue with it right? You're dictating what someone can set as $CXX, which can very easily clash because they need it set for another library or something.

I get that it's a useful thing to have, but it is flat out breaking this library in a bunch of cases. Maybe there's some way we can fuzzy match them without having to be exact?

saleyn commented 7 years ago

I agree that the https://github.com/saleyn/erlexec/pull/63 solution that requires to set $CXX could be error prone. However, there are production installations of this project that rely on the fact of being able to find the executable for the right architecture in a pre-packaged shared build. So, dropping that feature by finding first matching instance is not feasible. I would gladly accept another solution that preserves backward compatibility for this and works for #82 and #63.

whitfin commented 7 years ago

@saleyn what do you think about using current behavior for multiple binaries and this PR in the case of a single binary? That seems like it might satisfy both

saleyn commented 7 years ago

Yes, that's acceptable.

rstreif commented 7 years ago

@zackehh Essentially @saleyn is correct. In the case you have multiple exec-port in different architecture-specific directories. I do not know why @saleyn decided to put exec-port into an architecture-specific directory. The only use case I can imagine is that erlexec is on a drive that is shared by multiple machines that use different architectures.

For any standard build of erlexec where you build it for your build host, essentially building it for the same system and architecture, the change I made to the rebar.config.script should not make any difference at all since it maintains the original behavior of using erlang:system_info(system_architecture). Whether you are building in a build host environment or in a cross environment can typically pretty reliably be determined by checking the CXX variable which is commonly used by cross-build environments. If that variable is set then the compiler it contains is used to determine the host triplet.

That's why I asked how you are building erlexec and if the CXX variable is set for whatever reason. A stock Ubuntu does not have it set be default in a user environment. However, I do see where rebar.config.script could have an issue. I need to look into a possible fix.

whitfin commented 7 years ago

@saleyn cool, I've just pushed that change. Want to check it out?

@rstreif I understand the difficulty here, however I disagree that CXX is a good check. There are any number of reasons that flag could be set in an env where it's not cross environment (what if there are other things which need to be compiled?).

To answer your questions, CXX is very likely set on Travis and that use case makes sense for your changes - although it doesn't find correctly.

However, when I'm building on macOS Sierra with Erlang 19, CXX is definitely not set and it's generating the wrong values, e.g. it's looking for 16.0 but 15.5 is available, which is why CXX is not a good way to determine this.

I appreciate both of you looking at this, it's causing an issue with our build setups and I've now hit it in multiple scenarios and environments, so there's clearly an underlying issue. If either of you come up with a way to improve this (such as @rstreif saying there may be an issue), then that's awesome but I think this should suffice in the meantime.

rstreif commented 7 years ago

@zackehh Well, CXX and the C/C++ compiler it eventually points to is actually the correct check for the target triplet. It's very origins come from the GNU build system.

However, when I'm building on macOS Sierra with Erlang 19, CXX is definitely not set and it's generating the wrong values, e.g. it's looking for 16.0 but 15.5 is available, which is why CXX is not a good way to determine this.

I do not have access to MacOS-based systems. However, if CXX is not set on that system then erlang:system_info(system_architecture) is used. By the way, target triplets do not include any version numbers. They are machine-vendor-operatingsystem. I am not sure why erlang:system_info(system_architecture) on MacOS would include a version number. However, the function most likely relies on underlying OS facilities to report it. There may be something incorrect there. But it's easy to test. Just invoke erlang:system_info(system_architecture) in eshell.

To answer your questions, CXX is very likely set on Travis and that use case makes sense for your changes - although it doesn't find correctly.

Yes, it makes sense in Travis. If you could provide more info then we can fix it.

rstreif commented 7 years ago

The problem mostly stems from the ambiguous target triplet of machine-vendor-os. Vendor can be left out as it is typically meaningless for a build environment and os can be two fields in the form of kernel-os.

On Fedora/RedHat:

On Ubuntu:

On Debian:

Would be interesting to know what this is on MacOS. For the purpose of erlexec distinguishing the architecture may be sufficient?

whitfin commented 7 years ago

@rstreif here's the info for my machine - as you can see this is why I keep noticing errors ;)

MacOS:

zackehh:~$ gcc -dumpmachine
x86_64-apple-darwin16.1.0
zackehh:~$ erl -noshell -eval 'erlang:display(erlang:system_info(system_architecture))' -eval 'init:stop()'
"x86_64-apple-darwin15.5.0"

This is why it's building x86_64-apple-darwin16.1.0 and then failing to find the right executable at runtime as it's looking for x86_64-apple-darwin15.5.0.

I can try to get stuff on Travis but it's a pain as I'll have to fake commits to a repo just to get it to debug. That'll probably take a while. Anything else you need on MacOS though, just let me know :)

ccrusius commented 7 years ago

The fundamental problem here seems to be, as @rstreif points out, that the platform name is not set correctly by the build system. I had to produce systems that shipped with binaries for different platforms and know the problem well. It is not a nice problem to have to solve but it looks like @saleyn has no choice at this point.

I think the best solution, given the requirements, is to fix the build system to generate vendor-independent platform names (eg darwin-x86_64, linux-ppc, etc). I'm not sure how easy this is with rebar (which I gave up on some time ago and ended up coming up with an Erlang plugin for Gradle).

whitfin commented 7 years ago

@rstreif incidentally, it was building x86_64-apple-darwin16.1.0 and looking for x86_64-apple-darwin15.5.0. However this may help; if you do os:version(), it returns {16, 1, 0} - so maybe that can help with cross compilation?

rstreif commented 7 years ago

@zackehh Unfortunately, in a cross environment the compiler is really the only way to tell what system the binaries are built for. Maybe a solution would be to use an explicit environment variable to tell rebar that it is building in a cross environment. I can provide that easily through the build system.