Open kalvinnchau opened 5 years ago
Thanks for bringing it up! This is a great topic to discuss.
Option 2 definitely works by mounting a pre-created secret containing the keytab and principal into the operator pod, then letting the operator adds the Spark config options specifying the keytab and principal when running spark-submit
. This means the same keytab and principal will be used for all SparkApplication
s launched by the operator that need Kerberos support.
With sparkctl
, there's another option: sparkctl
can perform the authentication locally and create a secret storing the DT and then use Option 3 to get the DT secret mounted into the driver and executor pods. This, however, requires some change to that PR to allow specifying an existing secret or ConfigMap (as currently implemented in the PR) that stores krb5.conf
. With that change, sparkctl
also creates the secret or ConfigMap to carry the krb5.conf
after doing the Kerberos login.
Do you think the best approach would be to implement both options? That way users could could submit jobs with keytabs for long running (streaming type) jobs, and the the sparkctl
version where a user could use their own credentials?
Also to your point on Option 2 - would we be able to add some sort of configuration option that would allow users to specify which secret item/key to use for the keytab so that different credentials could be used for different SparkApplications
that launced by the operator? This would allow users in different namespaces to have different accounts to use for authenticating.
I'm not sure if the spark.kubernetes.driver.secrets.[SecretName]
are mounted before start-up we could potentially use those and check in the operator if one named keytab
exist or some special keyword.
Seems like it might be better to have a new specific configuration option for the spark-operator
that we look for and mount them in that way.
For the sparkctl
option, theres a section where they say we mount in the krb5 -
If a user wishes to use a remote HADOOP_CONF directory, that contains the Hadoop configuration files, or a remote krb5 file, this could be achieved by mounting a pre-defined ConfigMap and mounting the volume in the desired location that you can point to via the appropriate configs.
Seems like we could just mount it directly onto the /etc/krb5.conf
location or use --conf spark.kubernetes.kerberos.krb5location=/etc/krb5.conf
to mount to our specified location without changing the PR.
I think we can implement both options. For option 2, the secret storing the keytab and principal gets mounted into the operator pod. Additionally, we need a new optional field named UseKerberosAuthentication
or something similar in the SparkApplicationSpec
to indicate a need for Kerberos authentication if the field is set to true
. This option handles cases where the same keytab/principal are used for all applications launched by the operator, e.g., applications belonging to the same authentication group.
Then there's the second option that supports per-application Kerberos authentication. We can add new command-line flags to sparkctl create
to specify the keytab and principal used for authentication. Then sparkctl
performs the authentication and gets a DT on behalf of the user. It also creates a secret storing the DT, and adds Spark configuration properties as in the PR to specify the DT secret.
Seems like we could just mount it directly onto the /etc/krb5.conf location or use --conf spark.kubernetes.kerberos.krb5location=/etc/krb5.conf to mount to our specified location without changing the PR.
I misread that. You are correct. Then sparkctl
can create a ConfigMap storing the krb5.conf
and mount the ConfigMap onto the default location at /etc/krb5.conf
.
'm not sure if the spark.kubernetes.driver.secrets.[SecretName] are mounted before start-up we could potentially use those and check in the operator if one named keytab exist or some special keyword. Seems like it might be better to have a new specific configuration option for the spark-operator that we look for and mount them in that way.
I think it's better to avoid relying on implicit naming scheme, which is hard to validate.
For option 2, the secret storing the keytab and principal gets mounted into the operator pod
Did you mean to say the driver pod? Otherwise I'm a bit confused how this case is handled for passing in the keytab during the spark-submit
call
Did you mean to say the driver pod? Otherwise I'm a bit confused how this case is handled for passing in the keytab during the spark-submit call.
No, I meant the operator pod. The operator who runs spark-submit
needs to have the keytab local so spark-submit
can access it and use it for login. The k8s submission client run by spark-submit
does the Kerberos login using the given keytab file and principal specified through the Spark config properties documented in that PR.
Awesome! Thanks for all the input - I'm going to try to take a stab at this (while following the other PR to make sure things don't diverge to wildly before its merged). I'll start to go through the code to learn how things work.
Any pointers to look at or wrap my head around things? Also would you prefer one PR for both features or separate PRs for each?
No, I meant the operator pod. The operator who runs spark-submit needs to have the keytab local so spark-submit can access it and use it for login. The k8s submission client run by spark-submit does the Kerberos login using the given keytab file and principal specified through the Spark config properties documented in that PR.
Ahh, yeah that makes sense, confused my self a bit there.
Thanks for willing to take on this! I think we can use separate PRs for the two options. For option 1, i.e., mounting a secret storing the keytab into the operator pod, we probably need to add two more command-line flags to the operator for specifying the path to the mounted keytab file and the principal, additionally to mounting the secret into the operator container. The command-line flags can be added in main.go
. The values of the flags need further to be passed into the controller for SparkApplication
so it can add the Spark config properties when it seems a SparkApplication
with the field UseKerberosAuthentication
set to true. The field needs to be added to types.go
in both version v1alpha1
and v1beta1
under pkg/api
.
For option 2, i.e., using sparkctl
to do Kerberos login, the flags for specifying the keytab file and principal need to be added into sparkctl/cmd/create.go
. There's one issue with this option, however, I don't know if it's possible to do the login and get the DT in golang. This needs some investigation. One solution is to write a simple Java or Scala utility for doing the login using the Hadoop FS API, and then have sparkctl create
call that utility.
So given that option 2 may be blocked by the issue, we probably can go for option 1 first.
@kalvinnchau are you still interested in working on this?
We could also support proxy user. @kalvinnchau @liyinan926 are you still working on this? I could try help with the implementation.
Any news on this?
Hi,
@kalvinnchau @liyinan926 We are using spark-operator (v1beta1-0.9.0-2.4.0) and we have tried to use Kerberos authentication for sparkapplications launched by operator. A high level approach is as follows:- "kerberos principal", "keytab" in a form of Kuberentes generic secret and "krb.conf" in a form of Kubernetes configmap is passed through the sparkapplication yaml configuration. .spec.hadoopConfigMap is used for Hadoop site xmls.
On submission of spark job, the keytab, krb5.conf and Hadoop site xmls are copied to the operator pod and spark-submit is triggered by exporting HADOOP_CONF_DIR (inside operator pod) and appending kerberos related spark config properties(spark.kubernetes.kerberos.principal, spark.kubernetes.kerberos.keytab, spark.kubernetes.kerberos.krb5location) to spark-submit command. Also, Upon Completion or Failure of sparkapplication, the keytab, krb5.conf and hadoop site xmls are deleted from the operator pod.
Thanks, Breeta
Hi @liyinan926 @kalvinnchau @skonto
This is the PR for Kerberos with spark-operator where we have added Kerberos support for sparkapplications based on the approach shared in the above comment where different spark jobs can interact with same or different hadoop clusters and KDC.
The design followed is as per Apache Spark 3.0 where either HADOOP_CONF_DIR must be set ( See PR to export HADOOP_CONF_DIR in operator pod or spark.kubernetes.hadoop.configMapName
(newly introduced in Apache Spark) must be specified.
spark.kerberos.keytab , spark.kerberos.principal, spark.kubernetes.kerberos.krb5.path are the kerberos parameters used for spark submit.
spark.kubernetes.kerberos.krb5.configMapName
is another new parameter introduced by Apache spark to specify pre-created krb5 configMap as an alternative to spark.kubernetes.kerberos.krb5.path.
The above PR support spark submit where : -
1) Local Keytab, Principal and krb5.conf is used and HADOOP_CONF_DIR is set.
2) Local Keytab, Principal, spark.kubernetes.hadoop.configMapName' and
spark.kubernetes.kerberos.krb5.configMapName` is used.
Please Refer examples 2 and 3(b) at https://github.com/apache/spark/blob/master/docs/security.md under "Secure Interaction with Kubernetes" section.
Please have a look at both the PRs. Any inputs or suggestions are highly welcome :)
Kind Regards, Breeta
Hi @liyinan926 @kalvinnchau @skonto
Please let us know if there are any suggestions.
Kind Regards, Breeta
@liyinan926 @breetasinha1109 I can see this PR is merged, so hope Kerberos support is enabled in latest version. Little confused with the Open state of this ticket.
Hi @mirajgodha
The PR is merged on a fork of GoogleCloudPlatform/spark-on-k8s-operator repo.
PR Links below:- https://github.com/nokia/spark-on-k8s-operator/pull/7 https://github.com/nokia/spark-on-k8s-operator/pull/4
Thanks @breetasinha1109 , @liyinan926 Any reasons we are not merging this PR? Is there a way we can support Kerberos right now, or we have to use the Nokia fork for Kerberos?
@mirajgodha we can not contribute the changes implemented in our fork due to requirements in the Google CLA. The complete implementation of Kerberos support is open sourced on a separate branch so anyone can contribute it to the upstream repo who think it is important enough to have it upstream.
@CsatariGergely @breetasinha1109 I have merged the kerberos support branch with latest beta2 changes and using it for production in Comcast. Is it fine with you guys if we submit a pull request to get it merged upstream?
@CsatariGergely @breetasinha1109 I have merged the kerberos support branch with latest beta2 changes and using it for production in Comcast. Is it fine with you guys if we submit a pull request to get it merged upstream?
@chrevanthreddy yes it is fine for us. Please go ahead.
@liyinan926 @breetasinha1109 I want to access a hbase table with kerberos authentication, can you show me an example ? Thanks a lot !
Any updates for this, we'd love to use it as well :)
is kerberos authentication planned to be added any time soon?
waiting for the kerberos support with spark 3.2+
Hello, @liyinan926 ,
Do have any plans according to this PR?
ANY updates about this PR?
Any updates about this PR?
Any updates?
请问当前分支有提交社区的计划吗
Once https://github.com/apache/spark/pull/21669/files is pulled into the mainline spark, there will be kerberos support for HDFS clusters that use kerberos for authentication.
I wasn't able to find anything about the approach for Kerberos with the spark-operator, so I'm hoping to get a discussion started and maybe take a stab at implementation :)
I was thinking of using the approach where it would pass a local keytab/principal Option 2 to do the authentication. The operator could mount in the secret that contains the keytab and principal and pass them into the command line start-up command.