salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

[WIP] Scala 2.12 / Spark 3 upgrade #550

Open nicodv opened 3 years ago

nicodv commented 3 years ago

Related issues https://github.com/salesforce/TransmogrifAI/issues/336 https://github.com/salesforce/TransmogrifAI/issues/332

Describe the proposed solution Upgrade to Scala 2.12 and Spark 3

Describe alternatives you've considered Living in the past, suffering from security issues and missing out on feature and speed improvements

Additional context Add any other context about the changes here.

salesforce-cla[bot] commented 3 years ago

Thanks for the contribution! Before we can merge this, we need @wsuchy @koertkuipers to sign the Salesforce.com Contributor License Agreement.

tovbinm commented 3 years ago

@leahmcguire I think it is just not being used - https://github.com/salesforce/TransmogrifAI/pull/550#discussion_r597091426

leahmcguire commented 3 years ago

We should be careful in how we define unused in a public project. Also that functionality would be needed to migrate projects on Transmogrifai V0...

emitc2h commented 3 years ago

Hey @tovbinm,there's a unit test failure I've been investigating that's the result of a bug in Spark: https://issues.apache.org/jira/browse/SPARK-34805?focusedCommentId=17337491&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17337491.

I'm wondering why testData in SanityCheckerTest.scala (L137) is constructed the way it is, with the metadata for the features column added manually. The fact that the metadata isn't passed along in DataFrame.select anymore is discovered by this assertion.

I'm assuming Spark won't fix this any time soon, and I'm having trouble finding an alternative way of putting in the metadata in the schema of testData. I've tried .withColumn, but it still relies on .select under the hood. What's your take on this?

nicodv commented 3 years ago

Also pinging @Jauntbox (we know you're out there!) for question above.

tovbinm commented 3 years ago

This is a known issue indeed. We have been copying over the metadata between fields each time we apply our transformers, e.g OpTransformer1.transform

emitc2h commented 3 years ago

This is a known issue indeed. We have been copying over the metadata between fields each time we apply our transformers, e.g OpTransformer1.transform

I mean that there is a new problem with Spark 3.1. Even OpTransformer1.transform is broken now since it relies on .select to pass back the metadata into the output dataframe. SelectedModelCombinerTest tests the .transform function directly and also fails for the same reason.

tovbinm commented 3 years ago

StructField still has the metadata in it, it's just ExpressionEncoder in Spark 3.x does not allow passing it anymore. Oh, it's a true bummer. We rely heavily on this feature.

hedibejaoui commented 3 years ago

Hello, any estimation on when we can get this PR ready? Thank you!

nicodv commented 3 years ago

@hedibejaoui , we are running internal forks of TransmogrifAI and MLeap on Spark 3.1.1, so the bulk of the work has been done.

For public release, the MLeap dependency needs to be upgraded now that they're on Spark 3 too: https://github.com/combust/mleap/pull/765

But since they've upgraded to Spark 3.0.2 and TransmogrifAI to 3.1.1, we have some testing left to do.

hedibejaoui commented 3 years ago

@nicodv Thanks for the information. Actually, we are using Spark 3.0.x because of some internal dependencies, any chance we get a public release of TransmogrifAI for that version?

Fatma-abdel commented 3 years ago

Hello, When do you think this PR will merged for the public use? Thank you!

EhsanSadr commented 3 years ago

Hi, This PR adds important functionality that I need for my project. When will this PR merge ?

Thank you

MeriamAffes commented 3 years ago

Hi, we are waiting for the new PR adds. When it will be available ? Thanks