snowflakedb / spark-snowflake

Snowflake Data Source for Apache Spark.
http://www.snowflake.net
Apache License 2.0
211 stars 98 forks source link

[Security Issue] pem_private_key not redacted in Spark Logical Plan UI #525

Open Loudegaste opened 12 months ago

Loudegaste commented 12 months ago

Hi, we are using the snowflake spark connector to push data from foundry to snowflake. We noticed that the pem_private_key is not redacted from the Query Plan and therefore leaking.

We expect that the pem_private_key is redacted, just as the 'sfurl' in the screenshot.

We first raised the issue to the Foundry team. After review they concluded that this issue came from the Spark connector itself and should therefore be processed here.

Python version: 3.8.* Pyspark version: 3.2.1

Here is the code used for the spark connector:

{
        "sfURL": config["snowflake_account"],
        "sfUser": "...",
        "pem_private_key": key,
        "role": "...",
        "sfWarehouse": config["warehouse"],
        "sfDatabase": config["database"],
        "sfSchema": config["schema"],
    }

inp.dataframe().write.format(SNOWFLAKE_SOURCE_NAME).options(
                **connection_parameters
            ).option("dbtable", f'"{raw_table_name}"').mode("overwrite").save()

printed_key

Loudegaste commented 11 months ago

Hi, following up on this, further experimentation on our side has revealed that this seems to be a non-deterministic issue. While making multiple runs of exactly the same pipeline, the pem_private_key is sometime redacted and sometime not. So far I couldn't find any factor predicting the behaviour.

rshkv commented 11 months ago

@Loudegaste, neither Snowflake's connector nor Foundry seem to do anything additional about redacting the pem_private_key. They just rely on Spark's built-in redaction mechanism.

Spark, when rendering the query plan, just goes through SQLConf.redact which redacts based on the config values for spark.sql.redaction.options.regex and spark.redaction.regex. The former defaults to (?i)url and the latter is overridden in Foundry to include additional keywords.

I wonder if the non-determinism you see is explained by the fact that Spark, when redacting, looks for sensitive keywords not just in the config key but also in the config value. If the pem_private_key differs between runs, you may sometimes see it redacted because it happens to contain the string url in that run.

Loudegaste commented 11 months ago

Hi @rshkv, thanks for the reply. That would mean the issue needs to be raised with Spark directly ? Btw, do you think then that spark.redaction.string.regex could provide a work around in the mean time ? We've actually tried to change spark.redaction.regex to include 'pem' and this didn't solve the issue.

Loudegaste commented 11 months ago

As @rshkv suggested, the keys do indeed get redacted when they contain "url" as a substring. This allows us to have an ugly workaround by adding "url" at the end of the key being used.