snowplow / dataflow-runner

Run templatable playbooks of Hadoop/Spark/et al jobs on Amazon EMR
http://snowplowanalytics.com
19 stars 8 forks source link

Application & Configuration not sent with cluster.json #65

Closed danrods closed 3 years ago

danrods commented 3 years ago

Hi there, I'm new to Dataflow-runner but having a lot of trouble getting started. I'm using the Mac version of dataflow-runner to start my EMR instance in AWS, using a simple cluster.json at the bottom of this post

After running the command ./dataflow-runner run-transient --emr-config=cluster.json --emr-playbook=playbook.json

I try to submit a spark job using spark-submit in the playbook but it keeps failing with the error

"Cannot run program "spark-submit" (in directory "."): error=2, No such file or directory"

I would normally ask this somewhere like stack-overflow but looking in Cloud-Trail I see that Dataflow-runner is sending the command below

 "requestParameters": {
        "name": "dataflow-runner - snowflake transformer",
        "logUri": "s3://logs/data-snowplow-emr-etl-runner/",
        "releaseLabel": "emr-6.1.0",
        "instances": {
            "instanceGroups": [
                {
                    "instanceRole": "MASTER",
                    "instanceType": "m4.large",
                    "instanceCount": 1
                },
                {
                    "instanceRole": "CORE",
                    "instanceType": "r4.xlarge",
                    "instanceCount": 1,
                    "ebsConfiguration": {
                        "ebsOptimized": false
                    }
                }
            ],
            "ec2KeyName": "test",
            "placement": {
                "availabilityZone": ""
            },
            "keepJobFlowAliveWhenNoSteps": true,
            "terminationProtected": false,
            "ec2SubnetId": "test"
        },
        "visibleToAllUsers": true,
        "jobFlowRole": "x",
        "serviceRole": "x"
    },

I can verify in EMR that none of my configuration from cluster.json is sent, and that the Spark application is not installed. It seems to be a valid cluster configuration but none of it is being sent to EMR. Did I perhaps set this up improperly or is this an issue?

Thanks in advance

cluster.json
{
  "schema":"iglu:com.snowplowanalytics.dataflowrunner/ClusterConfig/avro/1-1-0",
  "data":{
    "name":"dataflow-runner - snowflake transformer",
    "logUri":"s3://logs/data-snowplow-emr-etl-runner/",
    "region":"us-west-2",
    "credentials":{
      "accessKeyId":"xxxxxxxx",
      "secretAccessKey":"xxxxxx"
    },
    "roles":{
      "jobflow":"x",
      "service":"x"
    },
    "ec2":{
      "amiVersion":"6.1.0",
      "keyName":"test",
      "location":{
        "vpc":{
          "subnetId": "test"
        }
      },
      "instances":{
        "master":{
          "type":"m4.large"
        },
        "core":{
          "type":"r4.xlarge",
          "count":1,
          "ebsConfiguration":{
            "ebs_optimized": false,
            "ebsBlockDeviceConfigs": [
              {
                "volumesPerInstance" : 1

              }
            ]
          }
        },
        "task":{
          "type":"m4.large",
          "count":0,
          "bid":"0.015"
        }
      }
    },
    "tags":[ ],
    "bootstrapActionConfigs":[ ],
    "configurations":[
      {
        "classification":"core-site",
        "properties":{
          "Io.file.buffer.size":"65536"
        }
      },
      {
        "classification":"mapred-site",
        "properties":{
          "Mapreduce.user.classpath.first":"true"
        }
      },
      {
        "classification":"yarn-site",
        "properties":{
          "yarn.resourcemanager.am.max-attempts":"1"
        }
      },
      {
        "classification":"spark",
        "properties":{
          "maximizeResourceAllocation":"true"
        }
      }
    ],
    "applications":[ "Hadoop", "Spark" ]
  }
}
chuwy commented 3 years ago

Hi @danrods. Sorry we left this without attention, but we use Github for bug reports and feature requests only. Please consider posting this on our support forums.