ipums / hlink

Hierarchical record linkage at scale
Mozilla Public License 2.0
12 stars 2 forks source link

Add the LightGBM ML library #162

Open riley-harper opened 3 hours ago

riley-harper commented 3 hours ago

In addition to XGBoost (#161), we would also like to add support for LightGBM. This should work similarly to XGBoost, since we'd also like to make LightGBM opt-in. From the documentation, it sounds like we'll need the SynapseML package to be able to run LightGBM on Spark.

To Do List

riley-harper commented 3 hours ago

We'll need to install the synapseml Python package, which you can import as synapse.ml. synapse.ml.lightgbm.LightGBMClassifier seems to be the class that we need for Spark integration. Part of the setup for SynapseML includes downloading additional Spark jars. I added a few lines to hlink.spark.session in set_conf():

         if os.path.isfile(jar_path):
             conf = conf.set("spark.jars", jar_path)
+
+        conf.set("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.0.8")
+        conf.set("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
+
         return conf

     def local(self, cores=1, executor_memory="10G"):

At first this caused an error when I tried to create the Spark context. But after searching around for a solution, I cleaned out .ivy2 and .m2 in my home directory and it ran without issues. These additional configurations should probably be dependent on synapse.ml being installed, so that users who aren't using LightGBM don't have to download them.

try:
    import synapse.ml
except ModuleNotFoundError:
    _synapse_ml_available = False
else:
    _synapse_ml_available = True

...

if _synapse_ml_available:
    conf.set("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.0.8")
    conf.set("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
riley-harper commented 1 hour ago

To get feature importances in training: https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#synapse.ml.lightgbm.mixin.LightGBMModelMixin.getFeatureImportances