Open priyamgupta01 opened 4 years ago
Have you solved the problem?
No.
On Mon, Mar 9, 2020, 4:54 PM Kriszhou1 notifications@github.com wrote:
Have you solved the problem?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jcrist/skein/issues/208?email_source=notifications&email_token=ACDWUJ6ZZBGA4WBVGOQDQI3RGTGV5A5CNFSM4KRKTHLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOGWTBQ#issuecomment-596470150, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDWUJY5XXQUO32QZEU5QXTRGTGV5ANCNFSM4KRKTHLA .
I think that I can answer this:
Skein uses the environment variables from Hadoop. So you need to set this:
export HADOOP_CONF_DIR=/my/hadoop_conf
I haven't tried it to run like this, but depending on the error that you will get afterwards you might need to add also the Hadoop jar libraries. For now just post here the error you get afterwards.
We had same Error, While normal MR job works. Each time we submit a skein job, it will always stunk at ACCEPT stage. After a few time, the job failed. We do has HADOOP_HOME, HADOOP_CONF_DIR setting. Track the logs:
licy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
20/05/01 13:26:07 INFO retry.RetryInvocationHandler: java.net.ConnectException: Your endpoint configuration is wrong; For more details see: http://wiki.apache.org/hadoop/UnsetHostnameOrPort, while invoking ApplicationMasterProtocolPBClientImpl.registerApplicationMaster over null after 16 failover attempts. Trying to failover after sleeping for 20759ms.
20/05/01 13:26:29 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)"
Try to use IPYTHON to find out why:
In [20]: app = client.submit_and_connect(spec)
20/05/02 08:42:58 INFO skein.Driver: Uploading application resources to hdfs://node14:9000/user/root/.skein/application_1588338572036_0007
20/05/02 08:42:58 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/05/02 08:42:58 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/05/02 08:42:58 DEBUG skein.Driver: Writing script for service 'hello' to hdfs://node14:9000/user/root/.skein/application_1588338572036_0007/hello.sh
20/05/02 08:42:58 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/05/02 08:42:58 DEBUG skein.Driver: Uploading file:/opt/anaconda3/lib/python3.7/site-packages/skein/java/skein.jar to hdfs://node14:9000/user/root/.skein/application_1588338572036_0007/skein.jar
20/05/02 08:42:58 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/05/02 08:42:59 DEBUG skein.Driver: Writing application specification to hdfs://node14:9000/user/root/.skein/application_1588338572036_0007/.skein.proto
20/05/02 08:42:59 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/05/02 08:42:59 INFO skein.Driver: Submitting application...
20/05/02 08:42:59 INFO impl.YarnClientImpl: Submitted application application_1588338572036_0007
20/05/02 08:42:59 DEBUG skein.Driver: New watcher callback requested for application application_1588338572036_0007
20/05/02 08:42:59 DEBUG skein.Driver: No watcher exists for application_1588338572036_0007, creating one
then it stop here.
After complied skein with source, and with mini yarn configuration, 8031, 8032,... addresses, everything OK with Hadoop 3.2.1.
Where we need to provide the configurations of yarn in skein. By default when I submit the skein application using "skein application submit sample.yaml" it tries to connect on 0.0.0.0:8032
log: INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
As my yarn cluster is remote, how can I provide the details of it in skein.
sample.yaml:
name: hello_world queue: default
master: resources: vcores: 1 memory: 512 MiB script: | sleep 60 echo "Hello World!"