Open pohutukawa opened 4 years ago
Digging deeper around here, it seems like there's an issue with PySpark 2.4 (issue #63, addressed in unmerged pull-request #64). Though the README.md
clearly states that flint
is already compatible with PySpark 2.4 with Python >= 3.5.
Besides that issue, looking also more closely at the README.md
, I'm of the opinion that the install instructions are lacking. I only just did the pip install ts-flint
thing, and nothing else. The instructions on ts-flint.readthedocs.io do not even mention anything at all, leading me to the assumption that a pip install
is sufficient. The only thing in the README mentioned is that some Scala artifact available from Maven Central is required, though I'm not interested in building ts-flint
, but just using it.
Anyway, just following the Maven Central link takes me somewhere, but I've got no idea on how/what to download and where to place it. I presume some JAR file(s) will be required. But how???
Any help would be appreciated for someone quite well informed on Python, OK on Java, but not well at all in the Scala eco-system. Stuff like that would be also awesome to find (both) in the README and the ReadTheDocs piece.
Update: I got part way further.
I have tracked down the flint-0.6.0.jar
file to download off of Maven Central. I have placed it in my virtual environment's lib/python3.7/site-packages/pyspark/jars/
directory, and the TypeError
mentioned in this issue's title disappears, and the code sample given above works.
But moving on forward still gave problems. Now I was getting a
py4j.protocol.Py4JJavaError: An error occurred while calling o97.showString.
Solution to this problem was to grab the correct version (fitting the Scala version Spark uses) off of Maven Central (e.g. here https://mvnrepository.com/artifact/org.clapper/grizzled-slf4j_2.11/1.3.4) and place it in the same jars/
directory mentioned above.
Doing so I got (after the adaptation below for getting spark
into the notebook) almost all things to work from the Flint Example.ipynb
:
import pyspark
sc = pyspark.SparkContext('local', 'Flint Example')
spark = pyspark.sql.SparkSession(sc)
The only thing still coughing up an error is the computation of sp500_decayed_return
. I'll see if I can get that to work somehow as well.
It may be useful to mention some of these things in the install instructions. Most Python/data science practitioners are not quite developers who are into building such artifacts from Scala sources, but would happily be able to do a hearty pip install ts-flint
and think they're done.
Actually README mentions:
You can use ts-flint with PySpark by:
pyspark --jars /path/to/flint-assembly-{VERSION}-SNAPSHOT.jar --py-files /path/to/flint-assembly-{VERSION}-SNAPSHOT.jar
For me adding --jars and --py-files fixed the problem
I'm using MacOS and I installed Spark 2.4.4 with Homebrew. But I used the aproach to add the Flint JAR to lib folder of Spark
Now I'm stuck at the Py4JJavaError as well:
Py4JJavaError: An error occurred while calling o409.concatArrowAndExplode.
: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.codegen.ExprCode.code()Ljava/lang/String;
vkrot's approach did not work for me either
When I switched back to Spark 2.3.4 I get this error:
Py4JJavaError: An error occurred while calling o420.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 8, localhost, executor driver): org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by query. Memory leaked: (4096)
Allocator(ROOT) 0/4096/4096/9223372036854775807 (res/actual/peak/limit)
To use Flint first I downloaded flint-0.6.0.jar
directly from Maven Repsoitory for Flint into the libexec/jars
directory of Spark:
/usr/local/Cellar/apache-spark/2.4.4/libexec/jars
Note: I also had to download missing dependency grizzled-slf4j_2.11 - Check Compile Dependencies in Maven Repository of Flint for other missing dependencies
Then I installed Python artifact with pip
:
pip install ts-flint
Now I can simply invoke pyspark
or spark-submit
without parameters --jars
and --py-files
@pohutukawa I have same problem, did you find a solution ?
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/java.py in new_reader(self)
37 try:
---> 38 return utils.jvm(self.sc).com.twosigma.flint.timeseries.io.read.TSReadBuilder()
39 except TypeError:
TypeError: 'JavaPackage' object is not callable
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-12-87fea234d827> in <module>
----> 1 df = fc.read.options(timeColumn="ds").parquet(TB.SALECOUNT_OUT, is_sorted=False)
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/context.py in read(self)
82 '''
83
---> 84 return readwriter.TSDataFrameReader(self)
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/readwriter.py in __init__(self, flintContext)
49 self._sqlContext = self._flintContext._sqlContext
50 self._jpkg = java.Packages(self._sc)
---> 51 self._reader = self._jpkg.new_reader()
52 self._parameters = self._reader.parameters()
53
~/.conda/envs/py3/lib/python3.7/site-packages/ts/flint/java.py in new_reader(self)
38 return utils.jvm(self.sc).com.twosigma.flint.timeseries.io.read.TSReadBuilder()
39 except TypeError:
---> 40 return utils.jvm(self.sc).com.twosigma.flint.timeseries.io.read.ReadBuilder()
41
42 @property
TypeError: 'JavaPackage' object is not callable
OK, I figure it out . My problem seems mainly caused by jupyter/spark session cache .
Restart jupyter notebook and create a fresh new spark session with {"spark.jars" : "hdfs://maskxdc/test/flint-0.6.0.jar" }
solve the problem .
It actually only need config spark.jars
.
I was facing the same problem with Pyarrow.
My environment:
When I try to enable Pyarrow optimization like this:
spark.conf.set('spark.sql.execution.arrow.enabled', 'true)
I get the following warning:
createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however failed by the reason below: TypeError: 'JavaPackage' object is not callable
I solved this problem by:
import os
from pyspark import SparkConf
spark_config = SparkConf().getAll() for conf in spark_config: print(conf)
This will print the key-value pairs of spark configurations.
2. I found the path to my jar files in this key-value pair:
`('spark.yarn.jars', 'path\to\jar\files')`
3. After I found the path where my jar files are located, I printed the names of jars for Pyarrow, like this:
jar_names = os.listdir('path\to\jar\files') for jar_name in jar_names: if 'arrow' in jar_name: print(jar_name)
I found the following jars:
arrow-format-0.10.0.jar arrow-memory-0.10.0.jar arrow-vector-0.10.0.jar
4. Then I added the path of arrow jars in the spark session config:
For adding multiple jar file paths, I used : as delimiter.
`spark.conf.set('spark.driver.extraClassPath', 'path\to\jar\files\arrow-format-0.10.0.jar:path\to\jar\files\arrow-memory-0.10.0.jar:path\to\jar\files\arrow-vector-0.10.0.jar')`
5. Then I restarted the kernel and Pyarrow optimization worked
Whenever I try to use Flint here locally (no Hadoop/EMR involved), it keep barfing at me with the above error message in the subject. It's a setup on top of Python 3.7 with PySpark 2.4.4 and OpenJDK 8; an Ubuntu 19.04 install.
Note: As I'm running locally only, I'm getting this log message from Spark, but everything does run perfectly using vanilla PySpark:
It happens when I try to either read a PySpark dataframe into a
ts.flint.TimeSeriesDataFrame
. This example is adapted from theFlint Example.ipynb
:The last line causes the "boom", with this (first part of) the stack trace:
Any ideas what may be going wrong and how the problem could be solved?