hashicorp / nomad-spark

DEPRECATED: Apache Spark with native support for Nomad as a scheduler
44 stars 16 forks source link

Seems that nomad on spark broken when build with nomad-scala-sdk 0.8.6.1 #23

Closed tantra35 closed 5 years ago

tantra35 commented 5 years ago

When we build lastest version spark on nomad when we launch spark job we got follow exeption

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: com.hashicorp.nomad.javasdk.ResponseParsingException: Unable to parse [simple type, class com.hashicorp.nomad.apimodel.Node] from response body JSON: {"Attributes":{"unique.consul.name":"ip-172-24-0-115","consul.version":"1.3.1","platform.aws.placement.availability-zone":"us-east-1a","unique.platform.aws.local-ipv4":"172.24.0.115","driver.docker.version":"18.09.2","driver.exec":"1","vault.version":"0.10.4","cpu.arch":"amd64","driver.docker":"1","nomad.revision":"fcc4149c55399eb4979cd3fe3fb983cfe6c8449a+CHANGES","consul.server":"false","unique.storage.volume":"/dev/xvda1","kernel.name":"linux","driver.docker.volumes.enabled":"1","unique.cgroup.mountpoint":"/sys/fs/cgroup","platform.aws.instance-type":"t2.xlarge","cpu.totalcompute":"9200","os.version":"16.04","nomad.version":"0.8.6","vault.accessible":"true","unique.platform.aws.local-hostname":"ip-172-24-0-115.ec2.internal","os.signals":"SIGBUS,SIGCONT,SIGALRM,SIGTRAP,SIGTSTP,SIGINT,SIGHUP,SIGSEGV,SIGSYS,SIGTTIN,SIGUSR1,SIGXCPU,SIGXFSZ,SIGFPE,SIGKILL,SIGQUIT,SIGIO,SIGCHLD,SIGILL,SIGSTOP,SIGTTOU,SIGWINCH,SIGABRT,SIGPIPE,SIGUSR2,SIGIOT,SIGPROF,SIGURG,SIGTERM","cpu.numcores":"4","unique.platform.aws.hostname":"ip-172-24-0-115.ec2.internal","kernel.version":"4.4.0-1075-aws","cpu.frequency":"2300","os.name":"ubuntu","vault.cluster_id":"61e38844-c8f3-eb77-1aa2-dc40390a5039","consul.datacenter":"amrv01","unique.storage.bytestotal":"16586264576","consul.revision":"f2b13f302","unique.network.ip-address":"172.24.0.115","vault.cluster_name":"vault-cluster-1546079e","memory.totalbytes":"16825647104","cpu.modelname":"Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz","unique.storage.bytesfree":"14180642816","unique.hostname":"ip-172-24-0-115","unique.platform.aws.instance-id":"i-0b3259c60b40c49f2","platform.aws.ami-id":"ami-0eace97d273ebb5c0","driver.docker.bridge_ip":"172.17.0.1","driver.raw_exec":"1"},"ComputedClass":"v1:10821272756959570835","CreateIndex":5370372,"Datacenter":"jupiterhub","Drain":false,"DrainStrategy":null,"Drivers":{"qemu":{"Attributes":null,"Detected":false,"HealthDescription":"Driver qemu is not detected","Healthy":false,"UpdateTime":"2019-03-19T23:19:03.629626062+03:00"},"rkt":{"Attributes":null,"Detected":false,"HealthDescription":"Driver rkt is not detected","Healthy":false,"UpdateTime":"2019-03-19T23:20:03.63414575+03:00"},"docker":{"Attributes":{"driver.docker.volumes.enabled":"1","driver.docker.bridge_ip":"172.17.0.1","driver.docker.version":"18.09.2"},"Detected":true,"HealthDescription":"Driver is available and responsive","Healthy":true,"UpdateTime":"2019-03-19T23:20:03.637791458+03:00"},"exec":{"Attributes":null,"Detected":true,"HealthDescription":"Driver exec is detected","Healthy":true,"UpdateTime":"2019-03-19T23:20:03.637143678+03:00"},"raw_exec":{"Attributes":null,"Detected":true,"HealthDescription":"Driver raw_exec is detected","Healthy":true,"UpdateTime":"2019-03-19T23:19:03.635327084+03:00"},"java":{"Attributes":null,"Detected":false,"HealthDescription":"Driver java is not detected","Healthy":false,"UpdateTime":"2019-03-19T23:19:03.635446043+03:00"}},"Events":[{"CreateIndex":0,"Details":null,"Message":"Node registered","Subsystem":"Cluster","Timestamp":"2019-03-19T23:19:03+03:00"}],"HTTPAddr":"172.24.0.115:4646","ID":"87a9491f-14ff-5465-8752-88fea65840ff","Links":{"consul":"amrv01.ip-172-24-0-115","aws.ec2":"us-east-1a.i-0b3259c60b40c49f2"},"Meta":{"CONSUL_ADDR":"172.24.0.115"},"ModifyIndex":5370381,"Name":"ip-172-24-0-115","NodeClass":"jupiterhub-dockerworker","Reserved":{"CPU":100,"DiskMB":0,"IOPS":0,"MemoryMB":256,"Networks":null},"Resources":{"CPU":9200,"DiskMB":13523,"IOPS":0,"MemoryMB":16046,"Networks":[{"CIDR":"172.24.0.115/32","Device":"eth0","DynamicPorts":null,"IP":"172.24.0.115","MBits":750,"ReservedPorts":null}]},"SchedulingEligibility":"eligible","SecretID":"","Status":"ready","StatusDescription":"","StatusUpdatedAt":1553027073,"TLSEnabled":false}
    at com.hashicorp.nomad.javasdk.JsonParser.extractValue(JsonParser.java:26)
    at com.hashicorp.nomad.javasdk.ResponseAdapter.apply(ResponseAdapter.java:32)
    at com.hashicorp.nomad.javasdk.NomadApiClient.execute(NomadApiClient.java:376)
    at com.hashicorp.nomad.javasdk.ApiBase.executeServerQueryRaw(ApiBase.java:203)
    at com.hashicorp.nomad.javasdk.ApiBase.executeServerQuery(ApiBase.java:116)
    at com.hashicorp.nomad.javasdk.ApiBase.executeServerQuery(ApiBase.java:101)
    at com.hashicorp.nomad.javasdk.NodesApi.info(NodesApi.java:102)
    at com.hashicorp.nomad.scalasdk.ScalaNodesApi.info(ScalaNodesApi.scala:43)
    at org.apache.spark.scheduler.cluster.nomad.NomadJobManipulator.fetchLogUrlsForTask(NomadJobManipulator.scala:63)
    at org.apache.spark.scheduler.cluster.nomad.SparkNomadJobController$$anonfun$fetchDriverLogUrls$1$$anonfun$apply$1.apply(SparkNomadJobController.scala:54)
    at org.apache.spark.scheduler.cluster.nomad.SparkNomadJobController$$anonfun$fetchDriverLogUrls$1$$anonfun$apply$1.apply(SparkNomadJobController.scala:53)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.scheduler.cluster.nomad.SparkNomadJobController$$anonfun$fetchDriverLogUrls$1.apply(SparkNomadJobController.scala:53)
    at org.apache.spark.scheduler.cluster.nomad.SparkNomadJobController$$anonfun$fetchDriverLogUrls$1.apply(SparkNomadJobController.scala:52)
    at scala.Option.flatMap(Option.scala:171)
    at org.apache.spark.scheduler.cluster.nomad.SparkNomadJobController.fetchDriverLogUrls(SparkNomadJobController.scala:52)
    at org.apache.spark.scheduler.cluster.nomad.NomadClusterSchedulerBackend.getDriverLogUrls(NomadClusterSchedulerBackend.scala:98)
    at org.apache.spark.SparkContext.postApplicationStart(SparkContext.scala:2384)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:556)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.fasterxml.jackson.databind.exc.InvalidFormatException: Can not construct instance of java.util.Date from String value ("2019-03-19T23:19:03.629626062+03:00"): not a valid representation (error: Failed to parse Date value '2019-03-19T23:19:03.629626062+03:00': Can not parse date "2019-03-19T23:19:03.629626062+0300": while it seems to fit format 'yyyy-MM-dd'T'HH:mm:ss.SSSZ', parsing fails (leniency? null))
 at [Source: {"Attributes":{"unique.consul.name":"ip-172-24-0-115","consul.version":"1.3.1","platform.aws.placement.availability-zone":"us-east-1a","unique.platform.aws.local-ipv4":"172.24.0.115","driver.docker.version":"18.09.2","driver.exec":"1","vault.version":"0.10.4","cpu.arch":"amd64","driver.docker":"1","nomad.revision":"fcc4149c55399eb4979cd3fe3fb983cfe6c8449a+CHANGES","consul.server":"false","unique.storage.volume":"/dev/xvda1","kernel.name":"linux","driver.docker.volumes.enabled":"1","unique.cgroup.mountpoint":"/sys/fs/cgroup","platform.aws.instance-type":"t2.xlarge","cpu.totalcompute":"9200","os.version":"16.04","nomad.version":"0.8.6","vault.accessible":"true","unique.platform.aws.local-hostname":"ip-172-24-0-115.ec2.internal","os.signals":"SIGBUS,SIGCONT,SIGALRM,SIGTRAP,SIGTSTP,SIGINT,SIGHUP,SIGSEGV,SIGSYS,SIGTTIN,SIGUSR1,SIGXCPU,SIGXFSZ,SIGFPE,SIGKILL,SIGQUIT,SIGIO,SIGCHLD,SIGILL,SIGSTOP,SIGTTOU,SIGWINCH,SIGABRT,SIGPIPE,SIGUSR2,SIGIOT,SIGPROF,SIGURG,SIGTERM","cpu.numcores":"4","unique.platform.aws.hostname":"ip-172-24-0-115.ec2.internal","kernel.version":"4.4.0-1075-aws","cpu.frequency":"2300","os.name":"ubuntu","vault.cluster_id":"61e38844-c8f3-eb77-1aa2-dc40390a5039","consul.datacenter":"amrv01","unique.storage.bytestotal":"16586264576","consul.revision":"f2b13f302","unique.network.ip-address":"172.24.0.115","vault.cluster_name":"vault-cluster-1546079e","memory.totalbytes":"16825647104","cpu.modelname":"Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz","unique.storage.bytesfree":"14180642816","unique.hostname":"ip-172-24-0-115","unique.platform.aws.instance-id":"i-0b3259c60b40c49f2","platform.aws.ami-id":"ami-0eace97d273ebb5c0","driver.docker.bridge_ip":"172.17.0.1","driver.raw_exec":"1"},"ComputedClass":"v1:10821272756959570835","CreateIndex":5370372,"Datacenter":"jupiterhub","Drain":false,"DrainStrategy":null,"Drivers":{"qemu":{"Attributes":null,"Detected":false,"HealthDescription":"Driver qemu is not detected","Healthy":false,"UpdateTime":"2019-03-19T23:19:03.629626062+03:00"},"rkt":{"Attributes":null,"Detected":false,"HealthDescription":"Driver rkt is not detected","Healthy":false,"UpdateTime":"2019-03-19T23:20:03.63414575+03:00"},"docker":{"Attributes":{"driver.docker.volumes.enabled":"1","driver.docker.bridge_ip":"172.17.0.1","driver.docker.version":"18.09.2"},"Detected":true,"HealthDescription":"Driver is available and responsive","Healthy":true,"UpdateTime":"2019-03-19T23:20:03.637791458+03:00"},"exec":{"Attributes":null,"Detected":true,"HealthDescription":"Driver exec is detected","Healthy":true,"UpdateTime":"2019-03-19T23:20:03.637143678+03:00"},"raw_exec":{"Attributes":null,"Detected":true,"HealthDescription":"Driver raw_exec is detected","Healthy":true,"UpdateTime":"2019-03-19T23:19:03.635327084+03:00"},"java":{"Attributes":null,"Detected":false,"HealthDescription":"Driver java is not detected","Healthy":false,"UpdateTime":"2019-03-19T23:19:03.635446043+03:00"}},"Events":[{"CreateIndex":0,"Details":null,"Message":"Node registered","Subsystem":"Cluster","Timestamp":"2019-03-19T23:19:03+03:00"}],"HTTPAddr":"172.24.0.115:4646","ID":"87a9491f-14ff-5465-8752-88fea65840ff","Links":{"consul":"amrv01.ip-172-24-0-115","aws.ec2":"us-east-1a.i-0b3259c60b40c49f2"},"Meta":{"CONSUL_ADDR":"172.24.0.115"},"ModifyIndex":5370381,"Name":"ip-172-24-0-115","NodeClass":"jupiterhub-dockerworker","Reserved":{"CPU":100,"DiskMB":0,"IOPS":0,"MemoryMB":256,"Networks":null},"Resources":{"CPU":9200,"DiskMB":13523,"IOPS":0,"MemoryMB":16046,"Networks":[{"CIDR":"172.24.0.115/32","Device":"eth0","DynamicPorts":null,"IP":"172.24.0.115","MBits":750,"ReservedPorts":null}]},"SchedulingEligibility":"eligible","SecretID":"","Status":"ready","StatusDescription":"","StatusUpdatedAt":1553027073,"TLSEnabled":false}; line: 1, column: 1973] (through reference chain: com.hashicorp.nomad.apimodel.Node["Drivers"]->java.util.LinkedHashMap["qemu"]->com.hashicorp.nomad.apimodel.DriverInfo["UpdateTime"])
    at com.fasterxml.jackson.databind.exc.InvalidFormatException.from(InvalidFormatException.java:74)
    at com.fasterxml.jackson.databind.DeserializationContext.weirdStringException(DeserializationContext.java:1022)
    at com.fasterxml.jackson.databind.deser.std.StdDeserializer._parseDate(StdDeserializer.java:788)
    at com.fasterxml.jackson.databind.deser.std.DateDeserializers$DateBasedDeserializer._parseDate(DateDeserializers.java:175)
    at com.fasterxml.jackson.databind.deser.std.DateDeserializers$DateDeserializer.deserialize(DateDeserializers.java:261)
    at com.fasterxml.jackson.databind.deser.std.DateDeserializers$DateDeserializer.deserialize(DateDeserializers.java:245)
    at com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:495)
    at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:95)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:260)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:125)
    at com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap(MapDeserializer.java:496)
    at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:342)
    at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(MapDeserializer.java:26)
    at com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:495)
    at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:95)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:260)
    at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:125)
    at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3807)
    at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2835)
    at com.hashicorp.nomad.javasdk.NomadJson.deserialize(NomadJson.java:105)
    at com.hashicorp.nomad.javasdk.JsonParser.extractValue(JsonParser.java:24)
    ... 30 more

As you may see sdk cant parse field with type Date, because it try to apply format yyyy-MM-dd'T'HH:mm:ss.SSSZ, to string 2019-03-19T23:19:03.629626062+03:00, but as we can see in doc https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html, last symbol in format string Z will pars only strings like this +0300 whish is differs from what we actualy get from nomad +03:00, so real format string must looks like this yyyy-MM-dd'T'HH:mm:ss.SSSX. So we made a conclusion, that nomad-scala-sdk 0.8.6 is broken. As workaround for this we simple downgrade nomad scala sdk to version 0.7.0.2

cgbaker commented 5 years ago

@tantra35 : try 0.8.6.1, per https://github.com/hashicorp/nomad-java-sdk/issues/24

tantra35 commented 5 years ago

@cgbaker Sorry for disappoint you, but we mind sdk 0.8.6.1. So even on this version(0.8.6.1) spark on nomad have exception that i mention early. In script that we mention in https://github.com/hashicorp/nomad-spark/issues/24, you can see that we especially replace 0.8.6.1 to 0.7.0.2 to workaround this

cgbaker commented 5 years ago

The issue likely relates to this: https://github.com/hashicorp/nomad-java-sdk/issues/6#issuecomment-432413422

tantra35 commented 5 years ago

@cgbaker set TZ to UTC looks like workaround, the same as our solution to rebuild with nomad scala sdk 0.7.0.2, but this is only workaround not complete fix. More complex solution imho to consider all possible datetime formats that nomad can produce, and in try / fallback maner handle them all