mesosphere-backup / dcos-cassandra-service

DEPRECATED—Open source Apache Cassandra running on DC/OS is now replaced by mesosphere/dcos-commons/frameworks/cassandra. This repository will be deleted at the end of 2017.
Apache License 2.0
116 stars 53 forks source link

Bug: Scheduler accepts offer for insufficient disk resources, fails to detect CREATE failure #418

Open dylanwilder opened 7 years ago

dylanwilder commented 7 years ago

Seems to be two separate issues here. Mesos agent is configured with ~15Gb ROOT disk resource, cassandra is looking for 20Gb. See offer here:

INFO  [2017-03-14 20:42:39,688] com.mesosphere.dcos.cassandra.scheduler.CassandraScheduler: Received Offer: id { value: "ec7ab46c-8786-43aa-8bab-148ea8f9a872-O36443" } framework_id { value: "ec7ab46c-8786-43aa-8bab-148ea8f9a872-0002" } slave_id 
{ value: "cf0e92c8-2784-4613-a873-5c936a02eb70-S50" } hostname: "10.X.X.50" resources { name: "cpus" type: SCALAR scalar { value: 32.8 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 245176.0 } role: "*" } resources { name:
 "disk" type: SCALAR scalar { value: 15275.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 675867.0 } role: "*" disk { source { type: PATH path { root: "/mnt/data1" } } } } resources { name: "disk" type: SCALAR scalar { val
ue: 675867.0 } role: "*" disk { source { type: MOUNT mount { root: "/mnt/data2" } } } } resources { name: "disk" type: SCALAR scalar { value: 675867.0 } role: "*" disk { source { type: PATH path { root: "/mnt/data3" } } } } resources { name: "di
sk" type: SCALAR scalar { value: 675867.0 } role: "*" disk { source { type: MOUNT mount { root: "/mnt/data4" } } } } resources { name: "disk" type: SCALAR scalar { value: 675867.0 } role: "*" disk { source { type: PATH path { root: "/mnt/data5" 
} } } } resources { name: "disk" type: SCALAR scalar { value: 675867.0 } role: "*" disk { source { type: MOUNT mount { root: "/mnt/data6" } } } } resources { name: "ports" type: RANGES ranges { range { begin: 31000 end: 31028 } range { begin: 31
030 end: 31295 } range { begin: 31298 end: 32305 } range { begin: 32307 end: 33000 } } role: "*" } attributes { name: "nfs" type: TEXT text { value: "group1" } } attributes { name: "dnsHostname" type: TEXT text { value: "dny1-bvlt-r1n11" } } att
ributes { name: "rack" type: TEXT text { value: "r1" } } attributes { name: "diskType" type: TEXT text { value: "SSD" } } attributes { name: "ipAddress" type: TEXT text { value: "10.X.X.50" } } url { scheme: "http" address { hostname: "10.X
.X.50" ip: "10.X.X.50" port: 5051 } path: "/slave(1)" }

Cassandra decides to accept this insufficient offer:

INFO  [2017-03-14 20:42:39,776] org.apache.mesos.offer.MesosResourcePool: Retrieving resource for reservation
INFO  [2017-03-14 20:42:39,776] org.apache.mesos.offer.OfferEvaluator: Satisfying resource requirement: name: "disk" type: SCALAR scalar { value: 20480.0 } role: "cassandra.storage" disk { persistence { id: "" principal: "cassandra.storage" } volume { container_path: "volume" mode: RW } } reservation { principal: "cassandra.storage" labels { labels { key: "resource_id" value: "" } } }
with resource: name: "disk" type: SCALAR scalar { value: 20480.0 } role: "*"
INFO  [2017-03-14 20:42:39,777] org.apache.mesos.offer.OfferEvaluator: Reserves Resource
INFO  [2017-03-14 20:42:39,777] org.apache.mesos.offer.OfferEvaluator: Creates Volume
INFO  [2017-03-14 20:42:39,778] org.apache.mesos.offer.OfferEvaluator: Fulfilled resource: name: "disk" type: SCALAR scalar { value: 20480.0 } role: "cassandra.storage" disk { persistence { id: "ec53c7c6-fe3d-4b16-8b14-cf98b5fa03e4" principal: "cassandra.storage" } volume { container_path: "volume" mode: RW } } reservation { principal: "cassandra.storage" labels { labels { key: "resource_id" value: "f686ff09-c058-4a0a-9d69-a9bf04111c7c" } } }
...
INFO  [2017-03-14 20:42:39,790] org.apache.mesos.offer.OfferAccepter: Performing Operation: type: RESERVE reserve { resources { name: "disk" type: SCALAR scalar { value: 20480.0 } role: "cassandra.storage" reservation { principal: "cassandra.sto
rage" labels { labels { key: "resource_id" value: "f686ff09-c058-4a0a-9d69-a9bf04111c7c" } } } } }
...
INFO  [2017-03-14 20:42:39,793] org.apache.mesos.offer.OfferAccepter: Performing Operation: type: CREATE create { volumes { name: "disk" type: SCALAR scalar { value: 20480.0 } role: "cassandra.storage" disk { persistence { id: "ec53c7c6-fe3d-4b1
6-8b14-cf98b5fa03e4" principal: "cassandra.storage" } volume { container_path: "volume" mode: RW } } reservation { principal: "cassandra.storage" labels { labels { key: "resource_id" value: "f686ff09-c058-4a0a-9d69-a9bf04111c7c" } } } } }

But after launching receives failed notification from master

INFO  [2017-03-14 20:42:39,869] INFO  [2017-03-14 20:42:39,869] com.mesosphere.dcos.cassandra.scheduler.CassandraScheduler: Received status update for taskId=node-0__dd62e043-d6a3-4fbb-8400-07c7af0da107 state=TASK_ERROR source=SOURCE_MASTER reason=REASON_TASK_INVALID message='Task uses more resources cpus(cassandra.storage, cassandra.storage, {resource_id: c9e73072-7792-4d9e-a5b1-6bfb269ec12c}):4; mem(cassandra.storage, cassandra.storage, {resource_id: fb6cdba9-246c-449f-a84a-2943e676ff08}):10240; disk(cassandra.storage, cassandra.storage, {resource_id: f686ff09-c058-4a0a-9d69-a9bf04111c7c})[ec53c7c6-fe3d-4b16-8b14-cf98b5fa03e4:volume]:20480; ports(cassandra.storage, cassandra.storage, {resource_id: 8ce72836-4b31-46ab-8c0a-cfbe92f17f31}):[31990-31994]; cpus(cassandra.storage, cassandra.storage, {resource_id: b5fc6e9b-b90b-40f9-a8a3-d20568d27b12}):0.1; mem(cassandra.storage, cassandra.storage, {resource_id: 16fce9a1-6652-4035-837f-d13acf3ee453}):768; ports(cassandra.storage, cassandra.storage, {resource_id: 2919035c-32e2-4233-9e3b-b8373b711d12}):[31995-31995] than available cpus(*):27.7; mem(*):233912; disk(*):15275; disk(*)[]:675867; disk(*)[]:675867; disk(*)[]:675867; disk(*)[]:675867; disk(*)[]:675867; disk(*)[]:675867; ports(*):[31000-31028, 31030-31295, 31298-31989, 31996-32305, 32307-33000]; cpus(cassandra.storage, cassandra.storage, {resource_id: b5fc6e9b-b90b-40f9-a8a3-d20568d27b12}):0.1; mem(cassandra.storage, cassandra.storage, {resource_id: 16fce9a1-6652-4035-837f-d13acf3ee453}):768; ports(cassandra.storage, cassandra.storage, {resource_id: 2919035c-32e2-4233-9e3b-b8373b711d12}):[31995-31995]; cpus(cassandra.storage, cassandra.storage, {resource_id: c9e73072-7792-4d9e-a5b1-6bfb269ec12c}):4; mem(cassandra.storage, cassandra.storage, {resource_id: fb6cdba9-246c-449f-a84a-2943e676ff08}):10240; ports(cassandra.storage, cassandra.storage, {resource_id: 8ce72836-4b31-46ab-8c0a-cfbe92f17f31}):[31990-31994]; cpus(cassandra.storage, cassandra.storage, {resource_id: 0cf86d71-ac9f-4242-bd3b-f4862bf91a12}):1; mem(cassandra.storage, cassandra.storage, {resource_id: 77c294be-0129-4c4b-bd9f-1269697f2c7b}):256'

And finally on attempting to relaunch is unable to as it cannot find the non existent peristence id

INFO  [2017-03-14 20:42:41,746] org.apache.mesos.offer.MesosResourcePool: Retrieving reserved resource
WARN  [2017-03-14 20:42:41,746] org.apache.mesos.offer.MesosResourcePool: Failed to find reserved resource: f686ff09-c058-4a0a-9d69-a9bf04111c7c, in available resources: [0cf86d71-ac9f-4242-bd3b-f4862bf91a12, 77c294be-0129-4c4b-bd9f-1269697f2c7b, 8ce72836-4b31-46ab-8c0a-cfbe92f17f31]
WARN  [2017-03-14 20:42:41,747] org.apache.mesos.offer.OfferEvaluator: Failed to satisfy resource requirement: name: "disk" type: SCALAR scalar { value: 20480.0 } role: "cassandra.storage" disk { persistence { id: "ec53c7c6-fe3d-4b16-8b14-cf98b5fa03e4" principal: "cassandra.storage" } volume { container_path: "volume" mode: RW } } reservation { principal: "cassandra.storage" labels { labels { key: "resource_id" value: "f686ff09-c058-4a0a-9d69-a9bf04111c7c" } } }

From the master logs:

Mar 14 20:42:39 dny1-bvlt-r1n16 mesos-master[7950]: E0314 20:42:39.860085  7998 master.cpp:1955] Dropping CREATE offer operation from framework ec7ab46c-8786-43aa-8bab-148ea8f9a872-0002 (cassandra-s4) at scheduler-85f415e9-2f99-463f-aa3d-1defa9f88acd@10.X.X.47:46523: Invalid CREATE Operation: Insufficient disk resources
mrbrowning commented 7 years ago

Hi Dylan, I tried to reproduce this with an analogous setup: agents offering 36GB of ROOT disk space, 40GB of MOUNT disk space, and with the Cassandra scheduler set up to expect 38GB of ROOT disk space (running scheduler version 1.0.25-3.0.10). I saw the expected behavior, which is that all incoming offers were rejected and no task launch or volume creation was attempted. Can you give some more details about your setup? DC/OS version, Cassandra version, scheduler configuration on launch?

triclambert commented 6 years ago

This repo is deprecated and will be archived in one week. Please see the latest version of Cassandra or DSE for DC/OS:

https://docs.mesosphere.com/service-docs/cassandra/ https://docs.mesosphere.com/service-docs/dse/ (enterprise-only)