Open LeoDashTM opened 5 years ago
Update: so dir
does show that leftJoin
is present for both l
and r
.
I can even print the 2 methods, but invoking it on one with the other as argument does not work for some reason... Any ideas?
Some googling around might indicate that there's some incompatibility between the framework and the library versions, but your front page does say that the library works with Spark 2.3., and the lib's version (per python) is 0.6.0 Is there a more thorough way to figure out if the framework and the library are compatible with one another, @icexelloss ? Thanks.
The error trace is calling this line surely:
https://github.com/apache/spark/blob/master/python/pyspark/sql/utils.py#L60
Are you using the databricks jar? The apache jar doesn't work on databricks runtime for some reason.
Are you using the databricks jar? The apache jar doesn't work on databricks runtime for some reason.
Li Jin, I appreciate your reply. I'm not sure what you mean by the "databricks jar"
I downloaded the flint jar file that is currently loaded onto DataBricks, its size is 2171677 and here is its manifest file:
Manifest-Version: 1.0
Implementation-Title: flint
Implementation-Version: 0.6.0
Specification-Vendor: com.twosigma
Specification-Title: flint
Implementation-Vendor-Id: com.twosigma
Specification-Version: 0.6.0
Implementation-URL: https://github.com/twosigma/flint
Implementation-Vendor: com.twosigma
As I stated previously: The flint libraries (the Scala and the Python ones) I installed on DataBricks via its UI (from the respective online repos, which might be dated) - I can try and install the latest builds from the freshest source code, if you think that will help. Or if you could provide me with a jar that you think should work I could try installing and using it instead.
Please, assist further! Thanks.
Built from source (master
branch) and its size is 8075230 (flint-assembly-0.6.0-SNAPSHOT.jar
)
There is issues with using the standard flint jar on Databricks platform so they build a specific jar for Flint 0.6 for their platform here:
https://github.com/databricks/databricks-accelerators/tree/master/projects/databricks-flint
There is issues with using the standard flint jar on Databricks platform so they build a specific jar for Flint 0.6 for their platform here:
https://github.com/databricks/databricks-accelerators/tree/master/projects/databricks-flint
I downloaded the one from DataBricks, its size is 2252415 and its name is flint_0_6_0_databricks.jar Still the same error... What gives?
@LeoDashTM Could you please try this in the local pyspark notebook with --packages?
@LeoDashTM Could you please try this in the local pyspark notebook with --packages?
I'm not too familiar yet with local pyspark, but see the 2nd comment of the issue I opened with DataBricks: https://github.com/databricks/databricks-accelerators/issues/1 I used the --jars argument there.
Ok, actually just tried:
spark-submit --packages com.twosigma:flint:0.6.0 test.py
and that did seem to produce the results I want!
So is it because locally I'm running 2.3.2 and it's 2.3.1 on DataBricks?
I'm not sure I can upgrade DataBricks instances from to 2.3.2 from 2.3.1, the latter is the highest stable version of Spark available on DB at the moment...
The data file is attached to the first comment and both have the reproducible code (the first one for DB and the second one for a local installation). Are you able to run either one of them against Spark 2.3.1 and see the issue? I really need it to work against that version, you see.
Thanks.
Update.
Just installed Spark 2.3.1 locally and running pyspark --packages com.twosigma:flint:0.6.0
then executing each line by hand seems to work correctly.
So the deal is indeed with the flavor of Spark installed on DataBricks then? Is that the right conclusion?
It sounds like it’s a Databricks specific issue. I think the way they handle third party packages is different from the open source one. That’s why the Databricks jar exists in the first place. On Thu, Nov 1, 2018 at 2:40 AM Leo Dashevskiy notifications@github.com wrote:
Update. Just installed Spark 2.3.1 locally and running pyspark --packages com.twosigma:flint:0.6.0 then executing each line by hand seems to work correctly. So the deal is indeed with the flavor of Spark installed on DataBricks then? Is that the right conclusion?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/twosigma/flint/issues/57#issuecomment-434946989, or mute the thread https://github.com/notifications/unsubscribe-auth/AAwbrMP_0bfk5tCTEXdgZH8UIG9QNYIpks5uqpd3gaJpZM4X812S .
With the Databricks' latest commit: https://github.com/databricks/databricks-accelerators/commit/d23782aa5036b04be0102f6fbe778690eae54d3a the printing of the clocks and the joining both seem to be working.
The results of a clock creation and printing are a bit unexpected for me:
>>> l = clocks.uniform(fc, '30s', begin_date_time='2018-8-1 5:55:35', end_date_time='2018-08-01 05:56:05')
>>> l.show( truncate = False )
+-----------------------+
|time |
+-----------------------+
|50552-02-05 14:23:200.0|
|50552-02-05 22:43:200.0|
+-----------------------+
Are these results as expecte0d @icexelloss ? Are you getting the exact same thing on your setup? And if this is as expected, then how do I create a clock with the range I specified via begin_date_time
and end_date_time
?
If not, what is deal, any idea?
Thanks.
Ok, I tried with TwoSigma's official latest build and the data frame is displayed as expected, so it's a DataBricks issue again...
Same issue: https://github.com/twosigma/flint/issues/58 ? @icexelloss ?
@icexelloss
(even though the "timeColumn" argument error can be bypassed by renaming the column in question to
time
) thejoinLeft
is not working for me:With the output being:
By the way, what is the way to even display the contents of the clocks DataFrame? The second to last commented out line (with the
.show
command) errors out, so I don't understand howTimeSeriesDataFrame
is inheriting from a regularDataFrame
, for which that method is available... Thedisplay
method also fails...Anyways, what is wrong with the
leftJoin
here? The clock is on the left like you indicated it should be. Swapping left and right data frames also does not help.Is this reproducible for you?
Please, advise, if I'm not using/calling it correctly or if it's a bug.
The flint libraries (the Scala and the Python ones) I installed on DataBricks via its UI (from the respective online repos, which might be dated) - I can try and install the latest builds from the freshest source code, if you think that will help.
Thanks.