salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

Custom Transformer Pipeline #444

Closed qingyuanxingsi closed 4 years ago

qingyuanxingsi commented 4 years ago

Thanks for your great work! In our testing, we found that automatic feature engineering can not fully address our problems, we would like the ability to cutomize the transformer pipeline.

For example, we would like to configure the feature transformation using a json file, load it and perform feature transformations accordingly, any guideline to accomplishing this??

Much thanks!

tovbinm commented 4 years ago

Thank you!

You write some custom logic that dynamically creates based on some configuration or input arguments. One such example that is already present in the codebase is FeatureBuilder.fromSchema which dynamically builds all feature extractors for Spark dataframe types. One can then similarly operate on the features using their types and apply further transformations, e.g.:

// Let's assume we materialized these feature dynamically. In this example from a struct type of a Spark dataframe
val df: Dataframe = ???
val (response: FeatureLike[_ <: FeatureType], features: Array[FeatureLike[_ <: FeatureType]) =
   FeatureBuilder.fromSchema(dataframe.schema, response = "label")

// Apply type specific transformations for particular feature types (can be conditioned based on your config)
val texts: Array[FeatureLike[Text]] = features.collect { case f if f.isSubtypeOf[Text] => f.asInstanceOf[FeatureLike[Text]] }
val tokenized: Array[FeatureLike[TextList]] = texts.map(_.tokenize())
val integrals: Array[FeatureLike[Integral]] = features.collect { case f if f.isSubtypeOf[Integral] => f.asInstanceOf[FeatureLike[Integral]] }
val abs: Array[FeatureLike[Integral]] = integrals.map(_.abs())

// Vectorize all the desired features 
val vectorized: FeatureLike[Vector] = (tokenized ++ abs).transmogrify(label = Some(label))
qingyuanxingsi commented 4 years ago

@tovbinm I'm aware of your point, my point is that we would like to have flexible control over the transformations applied to a/several columns to support hand-crafted features(different dataset different treatment), we would like to load a json file and construct a certain transformer automatically(possibly without explicit mapping), we do not want to modify the code, but allow the json file to determine the transformation pipeline. Fixed Transformations to given type cannot meet all our requirements.

Like a common way to dynamically build a transformer/estimator based on a json file with params all set. No code modification, just config

tovbinm commented 4 years ago

I see. You would need to develop some custom code to interpret json config file into a sequence of custom transformations in TransmogrifAI.

We did something similar in the past. Perhaps @tillbe would be willing to reveal some ideas on how to implement it?

qingyuanxingsi commented 4 years ago

@tovbinm @tillbe Any updates?

tillbe commented 4 years ago

The implementation will depend on your exact needs, but ultimately you will have to write a DSL/Parser for your custom features, e.g. using ANTLR - at least that's how we solved it when we had a similar use case. An alternative is using FastParse if you want to stay in native Scala.

  1. Define some schema for your JSON or whatever other format you want.
  2. Parse that schema and match it to the fields.
  3. Apply the custom transformers from the schema to the fields
  4. Optionally apply auto transformers to the rest.

I hope this helps, happy to elaborate further.

qingyuanxingsi commented 4 years ago

@tillbe My previous idea is like this:

{
    "className":"com.salesforce.op.stages.impl.feature.AliasTransformer",
    "params":{
        "outputFeatureName":"test"
    }
}

We can define a json file like this, load it and parse it into an AliasTransfomer to make the transformation, then actually we can create more transformers like AliasTransformer? I'm wondering why this will not work?

If it won't work, can you give a more detailed example illustrating your idea??I'm new to ANTLR.

tillbe commented 4 years ago

In general this approach should work. You probably have to parse the JSON into a case class first and then pattern match on it. In this case, you won't need ANTLR or FastParse, using a JSON decoder like circe is enough.

case class CustomFeature(
    transformer: CustomTransformer, // enums that list all custom transformers you want to support
    params: Map[String, String],
    ... // anything else you need to store
)

val customFeaturesString = // load json file here

import io.circe.parser.decode // you can any json library here

val decoded = decode[Seq[CustomFeature]](customFeaturesString)
val features = decoded.map { customFeature => customFeature.transformer match {
    case AliasTransformer => // do something
    case OtherTransformer => // do something else
    case ... // and so on
}}

And the json file can look like this:

[
  {
    "transformer": "AliasTransformer",
    "params": {
      "outputFeatureName": "test"
    }
  },
  {
    "transformer": "OtherTransformer",
    "params": {
      "otherParam": "2"
    }
  }
]

You probably need to add some information to your JSON file as to what fields to operate, and other parameters but that depends on your use case.

tillbe commented 4 years ago

You only need ANTLR/FastParse if you want more complicated syntax in your custom feature schema - JSON will simplify this a lot!