mesosphere / spark-build

Used to build the mesosphere/spark docker image and the DC/OS Spark package
52 stars 34 forks source link

dcos Spark doesn’t run jobs #70

Open janpavtel opened 7 years ago

janpavtel commented 7 years ago

Please answer the following questions before submitting your issue. Thanks!

What version of DC/OS + DC/OS CLI are you using (dcos --version)?

dcoscli.version=0.4.13
dcos.version=1.7.0
dcos.commit=92d61c576b3fe0dd1b8b15e7695b55ff7ce254fd
dcos.bootstrap-id=0ab2e04446f34465aed3b1ffb4f56836d681d6c7

What operating system and version are you using?

Ubuntu 16.04 LTS

What did you do?

./dcos package install spark

Installing Marathon app for package [spark] version [1.0.2-2.0.0]
Installing CLI subcommand for package [spark] version [1.0.2-2.0.0]
New command available: dcos spark
DC/OS Spark is being installed!

    Documentation: https://docs.mesosphere.com/current/usage/service-guides/spark/
    Issues: https://docs.mesosphere.com/support/
./dcos spark run --submit-args='-Dspark.mesos.coarse=true --driver-cores 1 --driver-memory 1024M --class org.apache.spark.examples.SparkPi https://downloads.mesosphere.com/spark/assets/spark-examples_2.10-1.4.0-SNAPSHOT.jar 30'

Run job succeeded. Submission id: driver-20161013074900-0001
./dcos spark status driver-20161013074900-0001
Submission ID: driver-20161013074900-0001
Driver state: QUEUED

What did you expect to see?

Spark should run jobs.

What did you see instead?

Job is in queued state all the time.

Spark is listed in packages but not as service

./dcos package list
NAME         VERSION      APP                   COMMAND  DESCRIPTION                                                                                                                                         
chronos      2.4.0        /chronos-default      ---      A fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules.                                                    
marathon-lb  1.2.2        /marathon-lb-default  ---      HAProxy configured using Marathon state                                                                                                             
spark        1.0.2-2.0.0  /spark                spark    Spark is a fast and general cluster computing system for Big Data.  Documentation: https://docs.mesosphere.com/current/usage/service-guides/spark/  
./dcos service
NAME         HOST     ACTIVE  TASKS  CPU   MEM     DISK  ID                                         
chronos   10.32.0.4    True     0    0.0   0.0     0.0   b7a4a4e3-c62f-4175-bc54-c7305a411174-0002  
marathon  172.16.0.5   True     7    6.0  7168.0  128.0  0f2fb632-3277-41d4-be0a-ed90d5c7c27d-0000  

System stats

{
    "allocator/event_queue_dispatches": 0,
    "frameworks/marathon/messages_processed": 117795,
    "frameworks/marathon/messages_received": 117795,
    "master/cpus_percent": 0.6,
    "master/cpus_revocable_percent": 0,
    "master/cpus_revocable_total": 0,
    "master/cpus_revocable_used": 0,
    "master/cpus_total": 10,
    "master/cpus_used": 6,
    "master/disk_percent": 0.000538493899873791,
    "master/disk_revocable_percent": 0,
    "master/disk_revocable_total": 0,
    "master/disk_revocable_used": 0,
    "master/disk_total": 237700,
    "master/disk_used": 128,
    "master/dropped_messages": 0,
    "master/elected": 1,
    "master/event_queue_dispatches": 25,
    "master/event_queue_http_requests": 0,
    "master/event_queue_messages": 0,
    "master/frameworks_active": 2,
    "master/frameworks_connected": 2,
    "master/frameworks_disconnected": 0,
    "master/frameworks_inactive": 0,
    "master/invalid_executor_to_framework_messages": 0,
    "master/invalid_framework_to_executor_messages": 0,
    "master/invalid_status_update_acknowledgements": 0,
    "master/invalid_status_updates": 0,
    "master/mem_percent": 0.241224970553592,
    "master/mem_revocable_percent": 0,
    "master/mem_revocable_total": 0,
    "master/mem_revocable_used": 0,
    "master/mem_total": 29715,
    "master/mem_used": 7168,
    "master/messages_authenticate": 0,
    "master/messages_deactivate_framework": 0,
    "master/messages_decline_offers": 2125953,
    "master/messages_executor_to_framework": 0,
    "master/messages_exited_executor": 0,
    "master/messages_framework_to_executor": 0,
    "master/messages_kill_task": 609,
    "master/messages_launch_tasks": 30777,
    "master/messages_reconcile_tasks": 38819,
    "master/messages_register_framework": 2,
    "master/messages_register_slave": 1,
    "master/messages_reregister_framework": 909,
    "master/messages_reregister_slave": 13,
    "master/messages_resource_request": 0,
    "master/messages_revive_offers": 3893,
    "master/messages_status_update": 43410,
    "master/messages_status_update_acknowledgement": 43402,
    "master/messages_suppress_offers": 0,
    "master/messages_unregister_framework": 0,
    "master/messages_unregister_slave": 0,
    "master/messages_update_slave": 14,
    "master/outstanding_offers": 0,
    "master/recovery_slave_removals": 0,
    "master/slave_registrations": 1,
    "master/slave_removals": 0,
    "master/slave_removals/reason_registered": 0,
    "master/slave_removals/reason_unhealthy": 0,
    "master/slave_removals/reason_unregistered": 0,
    "master/slave_reregistrations": 4,
    "master/slave_shutdowns_canceled": 0,
    "master/slave_shutdowns_completed": 0,
    "master/slave_shutdowns_scheduled": 0,
    "master/slaves_active": 5,
    "master/slaves_connected": 5,
    "master/slaves_disconnected": 0,
    "master/slaves_inactive": 0,
    "master/task_failed/source_slave/reason_container_launch_failed": 18737,
    "master/task_killed/source_master/reason_framework_removed": 1,
    "master/task_killed/source_slave/reason_executor_unregistered": 4,
    "master/task_lost/source_slave/reason_executor_terminated": 2,
    "master/tasks_error": 0,
    "master/tasks_failed": 19649,
    "master/tasks_finished": 8623,
    "master/tasks_killed": 605,
    "master/tasks_killing": 0,
    "master/tasks_lost": 2,
    "master/tasks_running": 7,
    "master/tasks_staging": 0,
    "master/tasks_starting": 0,
    "master/uptime_secs": 10799156.4709491,
    "master/valid_executor_to_framework_messages": 0,
    "master/valid_framework_to_executor_messages": 0,
    "master/valid_status_update_acknowledgements": 43402,
    "master/valid_status_updates": 43410,
    "registrar/queued_operations": 0,
    "registrar/registry_size_bytes": 1159,
    "registrar/state_fetch_ms": 4.617984,
    "registrar/state_store_ms": 6.88896,
    "system/cpus_total": 2,
    "system/load_15min": 0.15,
    "system/load_1min": 0.16,
    "system/load_5min": 0.17,
    "system/mem_free_bytes": 338624512,
    "system/mem_total_bytes": 7305834496
}

from dcos-cli issue

debasishg commented 7 years ago

Also facing the same issue with the same example.

ubuntu@ip-10-10-1-77:~/dcos$ dcos --version
dcoscli.version=0.4.14
dcos.version=1.8.6
dcos.commit=cfccfbf84bbba30e695ae4887b65db44ff216b1d
dcos.bootstrap-id=405172d16eaff8798d6b090dac99b51a8a9004d7```
debasishg commented 7 years ago

Looks like I have been able to fix this issue. In my case I noticed that spark was being shown in Completed Frameworks instead of Active Frameworks in http://dcos_url/mesos.

I uninstalled Spark as ..

  1. dcos package uninstall spark
  2. Remove the znode from ZK for Spark. This is the vital step which I was missing earlier (https://docs.mesosphere.com/1.8/usage/service-guides/spark/uninstall/). ZK maintains state which does not get cleaned by uninstall and have to be cleaned manually.

Reinstall Spark and now the submit works and the job finishes.

ignacio-dc commented 7 years ago

I am having this exact same issue but in AWS with a fresh install, and spark is the only service installed

mgummelt commented 7 years ago

Hi @ignacio-dc. Please ensure that your Spark Dispatcher is properly registered by verifiying that it appears in the active frameworks listed in /mesos/state.json. If it doesn't, it's likely that you failed to fully uninstall Spark from a previous install, and must do that: https://docs.mesosphere.com/1.8/usage/service-guides/spark/uninstall/)

If you continue to have problems, please open a new issue. This issue has been closed.

skonto commented 6 years ago

@ArtRand @susanxhuynh lets close this.

hantuzun commented 6 years ago

I'm experiencing the same error now but it's about the installation of Spark, not about running jobs. We may close this issue.

Edit: My new issue is Spark package fails to install with permission errors #208