ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.42k stars 544 forks source link

Making Ibis work with SQLAlchemy 1.4 #2690

Closed datapythonista closed 3 years ago

datapythonista commented 3 years ago

Closes #2689

datapythonista commented 3 years ago

can u check the version of sa and use the original if < 1.4 for compatibility (also likely we have a fixed version of sa in the min versions)

now it might work with LargeBinary in the older (not sure)

sqlalchemy.Binary was renamed to sqlalchemy.LargeBinary in 0.6, 11 years ago, so any version will work with this change.

A good idea to have a sqlalchemy pinned somewhere anyway. In any case, looks like more things are broken, and they are not as straight-forward. Will need to have a look.

datapythonista commented 3 years ago

Opened an issue in SQLAlchemy. Depending on the outcome will look for a workaround: https://github.com/sqlalchemy/sqlalchemy/issues/6068

datapythonista commented 3 years ago

Fixed an SQLAlchemy regression in sqlalchemy.case and reported it here: https://github.com/sqlalchemy/sqlalchemy/issues/6081

Still some more problems pending in SQLAlchemy 1.4.

emilyreff7 commented 3 years ago

Hi @datapythonista, do you know if there is an ETA for this? Do we need to wait on changes from the SQLAlchemy side, or could we just pin our CI to the previous version in the meantime? Thanks!

datapythonista commented 3 years ago

I'll finish this early next week, I don't have time before. I tried pinning, but it wasn't as straightforward as I thought, and closed the PR.

SQLAlchemy will probably fix the regressions, but I'll make Ibis work also in the broken version.

datapythonista commented 3 years ago

We also have the spark failure, which is prevenring the CI from being green, and I'm clueless of what's the issue there. Maybe you can have a look at that, as that will be the blocker soon.

icexelloss commented 3 years ago
2021-03-17T18:07:52.6174432Z a = ('xro12970', <py4j.java_gateway.GatewayClient object at 0x7f93ea2b3dd0>, 'o12963', 'collectToPython')
2021-03-17T18:07:52.6175210Z kw = {}
2021-03-17T18:07:52.6176226Z s = 'java.lang.IllegalArgumentException: Unsupported class file major version 55'
2021-03-17T18:07:52.6178497Z stackTrace = 'org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)\n\t at org.apache.xbean.asm6.ClassReader.<init>(ClassR...)\n\t at py4j.GatewayConnection.run(GatewayConnection.java:238)\n\t at java.base/java.lang.Thread.run(Thread.java:834)'
2021-03-17T18:07:52.6180372Z 
2021-03-17T18:07:52.6180666Z     def deco(*a, **kw):
2021-03-17T18:07:52.6180962Z         try:
2021-03-17T18:07:52.6181459Z             return f(*a, **kw)
2021-03-17T18:07:52.6182001Z         except py4j.protocol.Py4JJavaError as e:
2021-03-17T18:07:52.6182648Z             s = e.java_exception.toString()
2021-03-17T18:07:52.6183565Z             stackTrace = '\n\t at '.join(map(lambda x: x.toString(),
2021-03-17T18:07:52.6184195Z                                              e.java_exception.getStackTrace()))
2021-03-17T18:07:52.6185360Z             if s.startswith('org.apache.spark.sql.AnalysisException: '):
2021-03-17T18:07:52.6186334Z                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
2021-03-17T18:07:52.6187319Z             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
2021-03-17T18:07:52.6188285Z                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
2021-03-17T18:07:52.6189501Z             if s.startswith('org.apache.spark.sql.catalyst.parser.ParseException: '):
2021-03-17T18:07:52.6190683Z                 raise ParseException(s.split(': ', 1)[1], stackTrace)
2021-03-17T18:07:52.6192144Z             if s.startswith('org.apache.spark.sql.streaming.StreamingQueryException: '):
2021-03-17T18:07:52.6193505Z                 raise StreamingQueryException(s.split(': ', 1)[1], stackTrace)
2021-03-17T18:07:52.6194892Z             if s.startswith('org.apache.spark.sql.execution.QueryExecutionException: '):
2021-03-17T18:07:52.6196498Z                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
2021-03-17T18:07:52.6197530Z             if s.startswith('java.lang.IllegalArgumentException: '):
2021-03-17T18:07:52.6198540Z >               raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
2021-03-17T18:07:52.6199770Z E               pyspark.sql.utils.IllegalArgumentException: 'Unsupported class file major version 55'

Is any java dependency updated here? I thought SQLAlchemy is python only?

icexelloss commented 3 years ago

I will do a bit debug

icexelloss commented 3 years ago

https://github.com/ibis-project/ibis/pull/2694/checks?check_run_id=2150005088

Looks like indeed the java version is 11 in the new CI with doesn't work with Spark 3.7 pipeline (running spark 2.4.3 there). Although I am not sure if this is changed introduced by this PR or sth other change..

icexelloss commented 3 years ago

Let me try setting that to JAVA_HOME_8_X64=/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64 to Spark 3.7 pipeline

datapythonista commented 3 years ago

Thanks for helping with this @icexelloss. There are two different and unrelated things broken in the CI. One is the release of SQLAclhemy 1.4 introducing some regressions. The second, which was failing before, is the Java error in the PySpark build. I'll be having a look at the pending SQLAlchemy regressions on Monday. The PySpark stuff is unrelated, and I don't have any idea of why is failing, but it's not cause by the new SQLAlchemy release.

icexelloss commented 3 years ago

@datapythonista Thanks for the info. I think that pyspark failure is perhaps due to an orthogonal change in the base image (Still no idea why JAVA_HOME points to java11 now). I am working an setting that back for the Spark/PySpark 3.7 pipeline.

datapythonista commented 3 years ago

@datapythonista Thanks for the info. I think that pyspark failure is perhaps due to an orthogonal change in the base image (Still no idea why JAVA_HOME points to java11 now). I am working an setting that back for the Spark/PySpark 3.7 pipeline.

In case it's useful, I opened the PR #2688 reverting the changes since PySpark started failing. I was trying to check if it was those changes (which seem unrelated), or most likely something external, like a change in a dependency. The problem is that SQLAlchemy started to fail, and I couldn't see if the error was happening or not.

gerrymanoim commented 3 years ago

@icexelloss - The Java change is because of https://github.com/actions/virtual-environments/issues/1816. We use ubuntu-latest as our os test image, and that recently changed to 20.04 (the actions CI has had a warning this was coming for a long time). With that change, the default changed:

datapythonista commented 3 years ago

Thanks @gerrymanoim, that's very useful to know. I think we can then fix the CI by using #2695 (will need the SQLAlchemy problem fixed first). And then create an issue to make Ibis compatible with the latest version. Does it make sense?

icexelloss commented 3 years ago

@datapythonista I fixed the Spark CI and here is the PR to your branch: https://github.com/datapythonista/ibis/pull/3

icexelloss commented 3 years ago

Thanks @gerrymanoim . Super helpful

datapythonista commented 3 years ago

@datapythonista I fixed the Spark CI and here is the PR to your branch: datapythonista#3

I think the best way to fix the problem would be:

Or if we really want to use the environment variable approach, you can just set them in the build yaml, you can add a include option in the matrix to specify environment variables depending on the build of the matrix being executed.

icexelloss commented 3 years ago

@datapythonista I fixed the Spark CI and here is the PR to your branch: datapythonista#3

I think the best way to fix the problem would be:

  • Make sure in the code that we raise an exception with a meaningful message if an incompatible version of pyspark and java are found
  • In the CI, install java with conda (add java=XXX) to the pyspark minimum dependencies
  • I don't think we can do much for pip, but I guess if we require java for conda-forge, the pyspark and java versions should then be compatible, and we shouldn't have this problem happening anymore

Or if we really want to use the environment variable approach, you can just set them in the build yaml, you can add a include option in the matrix to specify environment variables depending on the build of the matrix being executed.

Yeah this sounds good. I definitely don't have much experience building CI with docker so I trust your decision here.

datapythonista commented 3 years ago

Yeah this sounds good. I definitely don't have much experience building CI with docker so I trust your decision here.

Happy to take care of it if it sounds good. It shouldn't take me long to get it implemented. Or let me know if you give it a try and have any question/problem.

icexelloss commented 3 years ago

Yeah this sounds good. I definitely don't have much experience building CI with docker so I trust your decision here.

Happy to take care of it if it sounds good. It shouldn't take me long to get it implemented. Or let me know if you give it a try and have any question/problem.

Trying now. I added java to the pyspark-min.yml and pyspark.yml. CI running now let's see how that goes.

Make sure in the code that we raise an exception with a meaningful message if an incompatible version of pyspark and java are found

I didn't add checks for this since we set specific java version now so this could be less an issue and also couldn't figure out a very clean way to do this. ..

datapythonista commented 3 years ago

I didn't add checks for this since we set specific java version now so this could be less an issue and also couldn't figure out a very clean way to do this. ..

It's not an issue for us, but I guess users can run Ibis in their own environment, with a new Java but Python 2. I think they'll appreciate if instead of the error we are having now we show them a nice message like Ibis with pyspark 2 work with JDK up to version 8. Use pyspark 3 to use a higher version of the JDK.

icexelloss commented 3 years ago

@datapythonista I tried to add openjdk in conda and it did work locally (JAVA_HOME is set properly after install openjdk), however in the CI, JAVA_HOME is unchanged after I install openjdk (e.g.: https://github.com/ibis-project/ibis/pull/2697/checks?check_run_id=2151864647)

I wonder if you have some idea what might be happening?

gerrymanoim commented 3 years ago

I'm a bit surprised you have to install openjdk via conda instead of setting the appropriate env variable in the spark pipeline. Is conda's openjdk=8 different than JAVA_HOME_8_X64?

icexelloss commented 3 years ago

@datapythonista We are kind of blocked by this now, in interest of time, I suggest that we use one of the following changes to unblock Spark now (either one should work):

I tried the conda idea but didn't quite work and I am not sure why, so I suggest we leave it as follow up:

datapythonista commented 3 years ago

@icexelloss I'm on this now. I fixed one of the errors this morning, and I'm with what I think it's the last one, but it's a bit trickier (but I expect to have it fixed today).

Once SQLAlchemy 1.4 is working, I'd merge #2695, and I think that should leave the CI green. And then I think it can make sense to make things more robust for spark/java version compatibility, but we can do that without having anything blocked.

Does this make sense to you?

icexelloss commented 3 years ago

@icexelloss I'm on this now. I fixed one of the errors this morning, and I'm with what I think it's the last one, but it's a bit trickier (but I expect to have it fixed today).

Once SQLAlchemy 1.4 is working, I'd merge #2695, and I think that should leave the CI green. And then I think it can make sense to make things more robust for spark/java version compatibility, but we can do that without having anything blocked.

Does this make sense to you?

Sounds good to me! Thank you very much!

datapythonista commented 3 years ago

@jreback this is green now (it's red because the spark error is independent and we'll fix it in #2695). Do you mind having a look please?

tl;dr:

The test that I had to change is not a material change. The SQL query is the same, and in 1.3 there is no difference between the old test and the new one, but in 1.4 the query misses the table prefix in the column names (e.g. SELECT t0.name FROM whatever AS t0 instead of SELECT name FROM whatever AS t0). I couldn't find a simple way to make tests pass without making this change.

jreback commented 3 years ago

thanks @datapythonista