Add documentation for PySpark

paulbauriegel commented 1 year ago

Thank you for writing this. Loading the library in PySpark, actually took some time to figure out. Maybe it makes sense to add an example to the Readme? This is the way I'm using it now:

ReflectionUtil = sc._gateway.jvm.py4j.reflection.ReflectionUtil
sc._jsc.hadoopConfiguration().setClass("fs.file.impl", 
    ReflectionUtil.classForName("com.globalmentor.apache.hadoop.fs.BareLocalFileSystem"), 
    ReflectionUtil.classForName("org.apache.hadoop.fs.FileSystem"))

wobu commented 1 year ago

i couldn't get it running, had to make some minor changes, i guess it could be to a different spark version i am using. I am using spark 3.2.2

tried with

spark = builder.getOrCreate()

ReflectionUtil = spark._sc._jvm.py4j.reflection.ReflectionUtil
spark._sc._jsc.hadoopConfiguration().setClass("fs.file.impl",
                                               ReflectionUtil.classForName("com.globalmentor.apache.hadoop.fs.BareLocalFileSystem"),
                                               ReflectionUtil.classForName("org.apache.hadoop.fs.FileSystem"))

but the code was still failing due to native code access :(

garretwilson commented 1 year ago

but the code was still failing due to native code access :(

I'm not familiar at all with PySpark, but I'll see what I can do to help. This could be the version of Spark, or it could be a different way you're using Spark that is causing a different code path to be invoked.

Please provide a stack trace that shows which native code is being invoked. It may be easy to add the necessary changes to provide native Java access for those other methods. It was always expected that this initial version might not cover all the methods for all code paths, but I'll need to know which code paths are involved to know which parts need added. Thanks.

wobu commented 1 year ago

import pyspark
import sys

def get_or_create_test_spark_session():
    """ Get or create a spark session
    """
    builder = pyspark.sql.SparkSession.builder \
        .appName("Tests") \
        .master("local[*]") \
        .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
        .config("spark.ui.enabled", "false") \
        .config("spark.driver.host", "127.0.0.1")

    if sys.platform.startswith('win'):
        builder = builder \
            .config("spark.jars.packages", "com.globalmentor:hadoop-bare-naked-local-fs:0.1.0")

        spark = builder.getOrCreate()

        ReflectionUtil = spark._sc._jvm.py4j.reflection.ReflectionUtil
        spark._sc._jsc.hadoopConfiguration().setClass("fs.file.impl",
                                               ReflectionUtil.classForName("com.globalmentor.apache.hadoop.fs.BareLocalFileSystem"),
                                               ReflectionUtil.classForName("org.apache.hadoop.fs.FileSystem"))
    else:
        spark = builder.getOrCreate()
    return spark

pyspark_stacktrace.txt

garretwilson commented 1 year ago

Thanks for the stack trace. I'll take a look, but it may be mid next week before I can get to it. Feel free to ping me in a few days if it slips my mind.

paulbauriegel commented 1 year ago

I would like to help but I cannot reproduce the issue yet. So just as reference what I did on Windows 10 with OpenJDK 11 is:

downloading Spark 3.2.3 with PreBuild Hadoop 3.2 here
then creating a empty winutils.txt in the bin folder that I renamed to winutils.exe, because otherwise Hadoop complains about the missing file
Setting HADOOP_HOME & SPARK_HOME to the spark root folder I downloaded & PATH to spark-3.2.3-bin-hadoop3.2/bin
Running pyspark --packages com.globalmentor:hadoop-bare-naked-local-fs:0.1.0
Then just reading some sample files via the code I shared
just using @wobu code as a script also worked

It also worked on my Intel Mac.

wobu commented 1 year ago

i am trying to run within a unittest / pytest i don't have spark or spark CLI manually installed. i am only using pyspark 3.2.3 and having it installed within a venv.

I am using an AMD machine. The Scala variant is working fine.

Admolly commented 1 year ago

Agreed. Some documentation for how to use with PySpark would indeed be helpful for those getting started on Windows. Thank you @paulbauriegel for opening the issue.

garretwilson commented 1 year ago

Hi, everyone. I'd like to make sure this issue is addressed. Help me get caught up on the status. Bear with me a bit as I haven't touched Python in a while and I've certainly never used PySpark.

This ticket seems to be primarily about updating the documentation for use with PySpark, but I also see some notes about someone not being able to get it to work on PySpark at all. The stack trace in this comment didn't show any references to the Bare Naked Local FileSystem at all, so I'm not sure PySpark is even using the correct FileSytem implementation in that case.

Could someone verify whether they are or are not getting this to work with PySpark, and explain how they did it? Thanks.

paulbauriegel commented 1 year ago

@garretwilson It's working fine with PySpark for me. How to - I described in the opening comment, if something is unclear with that I extend a bit on that in the following comment. I primarily opened that issue, so that others can find out how to use your library w. PySpark without much research. You can add it to the Readme as a comment or just close the issue. Either way I can confirm that it works on Mac and Windows with PySpark without any issue ( I only tested the local mode, not a cluster setup)

snoe925 commented 11 months ago

I have managed to get this configuration to work for me on windows for Pyspark. By using Hadoop configuration files the syntax is a bit easier. That is pyspark setups will look normal for examples on the internet.

There are problems; you will be limited to CSV formats. For example Parquet will not work or anything else that uses Hadoop classes for I/O.

No Hadoop is installed on windows. HADOOP_HOME is not set, so there are all the documented warnings.

TDLR; In $SPARK_HOME/jars create two Hadoop configuration files: core-site.xml, hdfs-site.xml

core-site.xml contains "" the xml flavor of empty. hdfs-site.xml contains:

fs.default.name com.globalmentor.apache.hadoop.fs.BareLocalFileSystem

I am running spark install pre-built for Apache Hadoop 3.3. I have no local Hadoop install and HADOOP_HOME is not set.

I'll look more into Parquet support.

kdionyso commented 10 months ago

Hi all, I know I am a bit late to the party. Have people managed to create tables using CSV? I can read files fine. I can also write empty tables, but the moment I am trying to populate them with data I get the dreaded

java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

error. Any ideas anyone?

As an example I can do the following:

    spark.sql(
        """
            CREATE TABLE IF NOT EXISTS TEST200 (
                MODEL_NAME STRING,
                MODEL_STAGE STRING
            ) USING CSV
        """
    )

but not

    spark.sql(
        f"""
        CREATE TABLE TEST201 USING CSV AS  (SELECT 'test' MODEL_NAME,
                'Production' MODEL_STAGE) 
        """
    )

Below is the beginning of the error:

py4j.protocol.Py4JJavaError: An error occurred while calling o40.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (127.0.0.1 executor driver): org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/<REDACTED>/spark-warehouse/test201.

The weird thing is that while I am writing the table with data, some part files get generated but they get subsequently deleted.

garretwilson commented 10 months ago

@kdionyso perhaps this might be better placed in a separate ticket? I think this ticket is more to add documentation. (And I want to get to that eventually! 😅 )

And if there is some way to get a stack trace, I could know better the code path that is arriving at this problem.

kdionyso commented 10 months ago

@garretwilson Yes, sure. I was just wondering whether people with the setup described above have managed to write tables in pyspark/Windows.

globalmentor / hadoop-bare-naked-local-fs

Add documentation for PySpark #2