Closed alexvorobiev closed 6 years ago
Strange error. Never seen it before. What does the output of
$ nix-shell -p nix-info --run "nix-info -m"
say?
- system: `"x86_64-linux"`
- host os: `Linux 3.10.0-514.26.2.el7.x86_64, Red Hat Enterprise Linux Server, 7.3 (Maipo)`
- multi-user?: `no`
- sandbox: `no`
- version: `nix-env (Nix) 1.11.16`
- channels(user1): `"nixpkgs-18.03pre125021.73a01aa028c"`
- channels(nixadmin): `"nixos-18.03pre124015.f59a0f7f1a6, nixpkgs-18.03pre124071.310ad4345bb"`
- channels(user2): `"nixpkgs-18.03pre117690.ab2cc75f78, nixos-18.03pre117327.3fe7cddc30"`
- nixpkgs: `/u/avorobiev/.nix-defexpr/channels_nixadmin/nixos`
@facundominguez any ideas?
Not really. Googling for the error shows that sometimes it is caused by too strict permissions. Perhaps tmp is mounted with noexec. But it is a shot in the dark.
/tmp
is mounted with noexec!
Adding --driver-java-options="-Djava.io.tmpdir=..."
got me a little further:
18/01/11 16:35:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" java.io.IOException: failure to login
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:824)
...
That's an unrelated problem. If you simplify the hello example to instead process a local file (rather than one on AWS S3), then you should be able to workaround that. I wonder why you're getting login failures though. Maybe Amazon recently invalidated the access key. @facundominguez can you reproduce?
Re the noexec thing - @facundominguez we should add this to the FAQ. Then we can close this ticket.
@mboes I have just changed the code to use a local file and still got this exception. Here is more output:
Exception in thread "main" java.io.IOException: failure to login
at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:824)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430)
at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2430)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2430)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:295)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.SparkContext.getOrCreate(SparkContext.scala)
at io.tweag.inlinejava.Inline__sparkle072126eViymYvMiEGHk2u3Idxs_Control_Distributed_Spark_Context.inline__method_2(Inline__sparkle072126eViymYvMiEGHk2u3Idxs_Control_Distributed_Spark_Context.java:13)
at io.tweag.sparkle.SparkMain.invokeMain(Native Method)
at io.tweag.sparkle.SparkMain.main(SparkMain.java:11)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
It looks like it happens before it gets to the line where it needs to read the file.
Yes, when loggin into to AWS. See the lines
confSet conf "spark.hadoop.fs.s3n.awsAccessKeyId" "AKIAIKSKH5DRWT5OPMSA"
confSet conf "spark.hadoop.fs.s3n.awsSecretAccessKey" "bmTL4A9MubJSV9Xhamhi5asFVllhb8y10MqhtVDD"
Remove them.
I just tried the hello example and it works for me. It doesn't look like a problem with revoked AWS credentials then.
Moreover, if I remount /tmp with noexec, I can reproduce this error. And if I use @alexvorobiev tip to use --driver-java-options="-Djava.io.tmpdir=..."
, it works again.
I finally figured out what the problem was. It is our use of active directory for authentication (no local unix users). RHEL uses sssd to communicate with the LDAP/AD server but libnss_sss.so.2
is not packaged by nix so we have to use the one from the host os via LD_LIBRARY_PATH
. It looks like when called from under stack spark-submit does not get that environment variable so it is unable to figure out my user name. Calling spark-submit directly (without stack exec --nix --
) solved the issue and the hello example now works for me.
I explained the error in the troubleshooting section of the readme in master.
I am using Nix on RHEL7.