mesosphere / marathon

Deploy and manage containers (including Docker) on top of Apache Mesos at scale.
https://mesosphere.github.io/marathon/
Apache License 2.0
4.07k stars 845 forks source link

Can't create new apps #199

Closed burke closed 10 years ago

burke commented 10 years ago

When I POST a new app without ports present, I get an error:

$ curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" -d@- $marathon/v2/apps <<EOF
{
  "id": "shopify",
  "cmd": "resque-garbage",
  "instances": 1,
  "mem": 1024,
  "cpus": 1.0,
  "container": {
    "image": "docker:///registry.borg.chi.shopify.com:5000/shopify:f8d8753d95c38a4daf82be8dd1f0daae414c0de4",
    "options": [

    ]
  },
  "env": {
    "RAILS_ENV": "production"
  },
  "executor": "/var/lib/borg/executors/docker"
}
EOF

{"errors":[{"attribute":"ports","error":"Elements must be unique"}]}

When I POST a new app with ports present, marathon NullPointerExceptions:

$ curl -X POST -H "Accept: application/json" -H "Content-Type: application/json" -d@- $marathon/v2/apps <<EOF
{
  "id": "shopify",
  "cmd": "resque-garbage",
  "instances": 1,
  "mem": 1024,
  "cpus": 1.0,
  "container": {
    "image": "docker:///registry.borg.chi.shopify.com:5000/shopify:f8d8753d95c38a4daf82be8dd1f0daae414c0de4",
    "options": [

    ]
  },
  "ports": [

  ],
  "env": {
    "RAILS_ENV": "production"
  },
  "executor": "/var/lib/borg/executors/docker"
}
EOF

{"message":null} (500)

This happens when I create a new app in the UI as well.

screen shot 2014-03-26 at 2 40 27 pm

This is occurring on da21cee8bbef0de7533b703642925df36d811e4f, but not ccd648a51b249925e7c5720779169bdb2b46f20a

burke commented 10 years ago

If it's helpful I can get the excerpt from marathon's log. Let me know.

burke commented 10 years ago

For what it's worth:

2014-03-27_17:25:39.19204 java.lang.NullPointerException
2014-03-27_17:25:39.19205       at mesosphere.marathon.state.AppRepository.store(AppRepository.scala:37)
2014-03-27_17:25:39.19206       at mesosphere.marathon.MarathonScheduler$$anonfun$startApp$1.apply(MarathonScheduler.scala:195)
2014-03-27_17:25:39.19207       at mesosphere.marathon.MarathonScheduler$$anonfun$startApp$1.apply(MarathonScheduler.scala:192)
2014-03-27_17:25:39.19207       at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:251)
2014-03-27_17:25:39.19208       at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249)
2014-03-27_17:25:39.19209       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
2014-03-27_17:25:39.19210       at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
2014-03-27_17:25:39.19211       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
2014-03-27_17:25:39.19211       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
2014-03-27_17:25:39.19212       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
2014-03-27_17:25:39.19213       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
2014-03-27_17:25:39.19214       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2014-03-27_17:25:39.19216

This sometimes happens; sometimes does not. It feels like Marathon decides at boot time whether or not it will present this bug, then sticks to its guns. Sometimes I can start marathon and it will work just fine, other times I restart it and it will fail. No recompilation necessary.

We thought this indicated a JIT bug, so we tried running marathon with the JIT disabled. This didn't affect the bug. We observed both cases with the JIT disabled.

I'm not really sure how many revisions back this goes.

burke commented 10 years ago

We managed to "fix" the problem by compiling like so:

mvn -DaddJavacArgs=-g:notc -DaddScalacArgs="-g:line" package && ./bin/build-distribution

With these flags, apps can consistently be pushed, unless we include container info. If we do, this error is ALWAYS produced:

2014-03-27_20:37:32.47641 com.fasterxml.jackson.databind.JsonMappingException: No suitable constructor found for type [simple type, class mesosphere.marathon.ContainerInfo]: can not instantiate from JSON object (need to add/enable type information?)
2014-03-27_20:37:32.47641  at [Source: org.eclipse.jetty.server.HttpInput@24ccab01; line: 1, column: 105] (through reference chain: mesosphere.marathon.api.v1.AppDefinition["container"])
2014-03-27_20:37:32.47642       at com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:164)
2014-03-27_20:37:32.47642       at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1078)
...many more lines

If we change the flags to -DaddJavacArgs=-g -DaddScalcArgs=-g:notailcalls, the error from immediately above no longer happens, and the original error happens about 20% of the time.

Relevant:

java version "1.7.0_51"
OpenJDK Runtime Environment (IcedTea 2.4.4) (7u51-2.4.4-0ubuntu0.12.04.2)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

Linux docker-test1.chi.shopify.com 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4 17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
graemej commented 10 years ago

@burke and I tried tried Oracle's 7u51 JDK as well, result: same intermittent behaviour.

As this is running on a fairly large box (32 cores) we tried taskset to pin the application to a single core in an attempt to reduce concurrency but still see the same intermittent behaviour.

guenter commented 10 years ago

Thanks for the report guys. I believe that's the same issue that's been puzzling us for a while now. Jackson has an issue with Scala case classes on JDK7. It uses reflection and JDK7 doesn't guarantee method ordering, so sometimes it doesn't use the default constructor, but one that takes arguments and just passes null for everything.

There is an issue for Jackson that seems related: https://github.com/FasterXML/jackson-module-scala/issues/117 According to this, behavior should be consistent on JDK6 since it has guaranteed ordering, but it's still failing for us sometimes on some JDK6 versions. Not sure what Jackson is really doing that breaks but a custom deserializer would probably fix it.

WIP branch here: https://github.com/mesosphere/marathon/commits/wip-deserialization-npe

ConnorDoyle commented 10 years ago

Duplicate of #181, fixed by merge of PR #215. Closing this for now, please comment if this resurfaces.

burke commented 10 years ago

Wonderful, thanks @ConnorDoyle!