nightscape / spark-excel

A Spark plugin for reading and writing Excel files
Apache License 2.0
465 stars 146 forks source link

How to use spark-excel library in Python #134

Closed Zulpitch closed 5 years ago

Zulpitch commented 5 years ago

Hello Everybody,

I'm using Azure Databricks, with Python and i'm trying to use your library "Spark-Excel".

I have added your library on my cluster : image

Problem : i don't know what import i have to write to use it. I tried several things :

import spark-excel import spark.excel import spark_excel import spark_excel_2_11_0_12_0 import com.crealytics.spark.excel

Nothing worked, and i can't use the methods linked to this library.

I haven't found an example in Python to give me the code line of code.

Hope you can help me on that question.

Thank you a lot !

nightscape commented 5 years ago

I guess one would have to write Python code which wraps the Scala code. Here e.g. is MLlib's Python code: https://github.com/apache/spark/tree/master/python/pyspark/mllib I don't have any time to invest in this though, but I'd review a PR if you or somebody else writes one.

Zulpitch commented 5 years ago

First, thank you for your quick answer.

So I guess we can't use Python with your library.

I'm going to test with in a Scala cell Databricks.

I'd love to have the needed skill to write a PR about it on your library :(

Do you think that I can find any other solution to import an excel file without using your library and the pandas.read-excel() method ?

nightscape commented 5 years ago

You could have a look at https://github.com/ZuInnoTe/hadoopoffice It's more general than spark-excel but I have no idea how to use it with Python.

Zulpitch commented 5 years ago

Ok, i'm going to check it ! I think we can put this thread to resolved, since this option will be my only one left. Thank you again for your quick answers

mcgoddard commented 5 years ago

Aware this has already been closed but for future reference you can read from python at least (haven't tried the writer yet) with something like:

path = "a path"
df = spark.read.format("com.crealytics.spark.excel").option("useHeader", True).load(path)

No import required, simply installing the library to make it available to spark in Databricks is enough!

Zulpitch commented 5 years ago

Really ? Going to try that INSTANT !

Zulpitch commented 5 years ago

Just tried.

testDf = spark.read.format("com.crealytics.spark.excel").option("useHeader", True).load(fileTest)
display(testDf)

gives me the following exception :

No value for dfs.adls.oauth2.access.token.provider found in conf file.

So i guess i don't have the rights to access the file, but if i manage to create the config to access it, i guess it should work ?

By the way, if i do :

testDf = spark.read.csv(fileTest)
display(testDf)

It can get my file in my Azure Data Lake (but not really well, because it's not a CSV), and i don't need more rights, because i've already done the configuration (but can't show it to you for obvious security reasons)

Strange, isn't it ?

mcgoddard commented 5 years ago

Strange that it works for one and not the other, can't comment beyond that though as I'm on AWS rather than Azure.

Zulpitch commented 5 years ago

Azure Databricks probably doesn't have the same engine / work system, so it doesn't get the same informations from the library, i don't know ... Or maybe it works, and i just need additionnal configuration. Thanks a lot for your answer thought ! It's always good to be helped :)

Zulpitch commented 5 years ago

Problem resolved !

The problem was with the Credentials. I was doing like that :

#Configuration to read from Azure Data Lake Storage
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "****")
spark.conf.set("dfs.adls.oauth2.credential", "****")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/****/oauth2/token")

#Get the path to the folder of the Azure Data Lake Storage
adlsAccountName = "****";
projectPath = "****";
projectFullPath = "adl://" + adlsAccountName + ".azuredatalakestore.net/" + projectPath;

It was working for spark.read.csv() or things like that.

How i made it work :

# Mount point
udbRoot = "****"

configs = {
   "dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
   "dfs.adls.oauth2.client.id": "****",
   "dfs.adls.oauth2.credential": "****",
   "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/****/oauth2/token"
}

# unmount 
#dbutils.fs.unmount(udbRoot)

# Mounting
dbutils.fs.mount(
  source = "adl://****",
  mount_point = udbRoot,
  extra_configs = configs
)

This way, it worked, and the code line you gave me

testDf = spark.read.format("com.crealytics.spark.excel").option("useHeader", True).load(fileTest)
display(testDf)

worked successfully.

Thank you again, we can close the thread definitely this time :)

jonnio commented 2 years ago

Did you try: from xyz import abc ? I've seen that work in cases where the names have weird characters, etc.

shyammk commented 1 year ago

I see that a feature to get the sheet names dynamically from an excel file was added to this library. Is there a way to use that feature in python (on Azure Databricks)?