Closed emnajaoua closed 5 years ago
Couple questions, are you running this on Databricks or a managed cluster service? The reason I ask is you will need to make sure the auto-scaling options are disabled for this algorithm to work properly.
Also, have you tried MMLSpark version 0.17? There were improvements to the efficiency of LightGBM that might help.
@imatiach-msft any other ideas for debugging?
Hi, thank you for your reply. I am running this on Databricks. Concerning the auto-scaling option I didn't find as an option here in the lightgbm classifier https://mmlspark.azureedge.net/docs/pyspark/LightGBMClassifier.html
I will try again with MMLSpark version 0.17
hi @emnajaoua , sorry about the trouble you are having,
from 2 until 10 worker nodes
please make sure to disable dynamic allocation, as that is not supported yet
Sorry for the confusion, I meant auto-scaling as an option on the Databricks cluster. See info here: https://docs.databricks.com/user-guide/clusters/sizing.html
Make sure Enable Auto-Scaling is disabled.
I did exactly as you recommended. Auto-scaling is disabled. Now I am working with 10 worker nodes and I have updated also MMLSpark to version 0.17. the model is training since 45 minutes and I am still waiting for the result: It is stuck in this step. My question is my repartioning correct ? the train data had previously 200 partitions and then I decreased to 40 partitions.
Also, is 10 worker nodes with 28 GB memory for each is enough in my case ?
Thank you in advance :)
@emnajaoua you have only 10 executors or tasks? It looks like you have 40 tasks. How many rows/columns do you have? What is the current lightgbm debug output in the log4j logs - has it gotten past the network init stage? If it hasn't gotten past network init then it may be stuck and time out (the driver might be waiting to get all of the workers and there may be fewer workers than it is expecting).
Lightgbm repartitions the data to the number of possible workers/tasks on the cluster, so if you have 200 partitions it will repartition the data before doing training.
@imatiach-msft since I have 10 worker nodes I assume that I have 10 executors. or is not the case ? I have 10 columns (9 features and the column feature according to the feature_process) and 1.502.306 rows in my train data.
Concerning the driver logs, you will find attached the log file, these are the last logs I am getting
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1096 (name: number of output rows)
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1061 (name: number of files read)
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1370 (name: internal.metrics.output.bytesWritten)
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1100 (name: collision rate (min, med, max))
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1308 (name: internal.metrics.input.sampledBytesRead)
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1063 (name: dynamic partition pruning time total (min, med, max))
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1032 (name: spill write time total (min, med, max))
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1079 (name: number of output rows)
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1271 (name: internal.metrics.shuffle.read.recordsRead)
19/07/08 15:32:29 INFO BlockManagerInfo: Removed broadcast_33_piece0 on 10.139.64.4:36295 in memory (size: 46.3 KB, free: 9.4 GB)
19/07/08 15:32:29 INFO ContextCleaner: Cleaned shuffle 38
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1003 (name: dynamic partition pruning time total (min, med, max))
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1176 (name: internal.metrics.shuffle.read.localBlocksFetched)
19/07/08 15:32:29 INFO ContextCleaner: Cleaned accumulator 1141 (name: internal.metrics.peakExecutionMemory)
19/07/08 15:32:29 INFO BlockManagerInfo: Removed broadcast_33_piece0 on 10.139.64.5:46621 in memory (size: 46.3 KB, free: 9.4 GB)
19/07/08 15:32:29 INFO BlockManagerInfo: Removed broadcast_33_piece0 on 10.139.64.7:34257 in memory (size: 46.3 KB, free: 9.4 GB)
19/07/08 15:32:29 INFO BlockManagerInfo: Removed broadcast_33_piece0 on 10.139.64.15:36649 in memory (size: 46.3 KB, free: 9.4 GB)
19/07/08 15:32:29 INFO BlockManagerInfo: Removed broadcast_33_piece0 on 10.139.64.9:38729 in memory (size: 46.3 KB, free: 9.4 GB)
19/07/08 15:37:26 INFO HiveMetaStore: 2: get_database: default
19/07/08 15:37:26 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
19/07/08 15:37:26 INFO DriverCorral: Metastore health check ok
19/07/08 15:37:26 INFO DriverCorral: DBFS health check ok
19/07/08 15:42:26 INFO HiveMetaStore: 2: get_database: default
19/07/08 15:42:26 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
19/07/08 15:42:26 INFO DriverCorral: Metastore health check ok
19/07/08 15:42:26 INFO DriverCorral: DBFS health check ok
19/07/08 15:47:26 INFO HiveMetaStore: 2: get_database: default
19/07/08 15:47:26 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
19/07/08 15:47:26 INFO DriverCorral: Metastore health check ok
19/07/08 15:47:26 INFO DriverCorral: DBFS health check ok
19/07/08 15:52:26 INFO HiveMetaStore: 2: get_database: default
19/07/08 15:52:26 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: default
19/07/08 15:52:26 INFO DriverCorral: Metastore health check ok
19/07/08 15:52:26 INFO DriverCorral: DBFS health check ok
I apologize I don't know exactly how to attach the log file that I got from driver logs but I think that it is not stuck at the init process. Concerning the repartitioning thing, do you think then that it is useless to repartition the train data before training the model ?
@emnajaoua how many CPUs do you have per worker? Can you make sure it is set to 1? That was a bug fixed on latest master (to allow multiple CPUs per task). Are you running binary or multiclass classification and how many classes? Does each partition have at least one of each label?
@emnajaoua you can also try running the build from latest master which has several fixes: --packages com.microsoft.ml.spark:mmlspark_2.11:0.17.dev27 and --repositories https://mmlspark.azureedge.net/maven
You can also try using barrier execution mode as I recommended here: https://github.com/Azure/mmlspark/issues/600
you can set .setUseBarrierExecutionMode(true) in scala or useBarrierExecutionMode=True in python.
see doc for more info on barrier execution mode: https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md#barrier-execution-mode Which was introduced in spark 2.4.
@imatiach-msft Thank you for your help. I don't know exactly how many CPUs per worker but I would like to know. Here is an image of my cluster configuration I am not sure how to set 1 CPU per worker. is it like spark.conf.set property ?
I am running a multiclass classification: 13 classes. To be honest I am not sure that each partition have at least one label/class but I would like to know how to do it or if you have any example for that.
@emnajaoua it looks like you are using databricks, which by default has 1 CPU per task, so I think you are good unless you explicitly set
spark.task.cpus 2
eg:
@emnajaoua you can also try running the build from latest master which has several fixes: --packages com.microsoft.ml.spark:mmlspark_2.11:0.17.dev27 and --repositories https://mmlspark.azureedge.net/maven
You can also try using barrier execution mode as I recommended here:
600
you can set .setUseBarrierExecutionMode(true) in scala or useBarrierExecutionMode=True in python.
see doc for more info on barrier execution mode: https://github.com/Azure/mmlspark/blob/master/docs/lightgbm.md#barrier-execution-mode Which was introduced in spark 2.4.
I will try again with the useBarrierExecutionMode=True and the build from the last master. I will let you know tomorrow how did it go :)
@emnajaoua it looks like you are using databricks, which by default has 1 CPU per task, so I think you are good unless you explicitly set
spark.task.cpus 2
eg:
in my Spark config there is no specification for spark.task.cpus so I guess it is one CPU per task.
can you also do a count of the distinct classes:
you can use countDistinct as one way to find out how many of each class you have, but not per partition
@imatiach-msft I have implemented the countDistinct on the raw data (before feature_processor) to know the number of classes as you recommended. as I mentioned before here are the 13 classes
@imatiach-msft as you recommended to me, I installed the latest repo for mmlspark and I set useBarrierExecutionMode=True. The model is still training (it has been 3 hours on this step):
according to the logs (stdout) for one worker, it is still training
@emnajaoua if you have gotten to that point then you must be training and you have gotten past the network initialization stage, which is where you were getting the connection refused error. When you look at the workers, did any of them have status failure or success? Sometimes a worker may exit earlier, I've seen that happen when a partition only has one label, but I think that is unlikely in your scenario. Also, with 13 classes training might take longer than usual, but based on your dataset size I don't think it should take 3 hours. Are you seeing iterations progress in the debug? I think it might be easier to just discuss over skype or teams, I was able to resolve an issue like that yesterday. I can send you a teams link.
I wonder if the multiclass classifier is having issues with different numbers of labels on different partitions, since some of the labels have very few instances (eg 1 or 2). Maybe I can add a test and see if I can reproduce this issue.
@imatiach-msft concerning whether the workers have failure or success status, I stopped the training now but if it is okey with you we can schedule today a skype meeting and before that I will run my notebook again so we can check the worker statues and depending on that, we will adjust :)
@emnajaoua here is a teams link if that works for you: https://teams.microsoft.com/l/meetup-join/19%3ameeting_NmU1NjgyODEtYjBkNS00YTg2LTk2MDYtNzdkNTc2NWUxZDU2%40thread.v2/0?context=%7b%22Tid%22%3a%2272f988bf-86f1-41af-91ab-2d7cd011db47%22%2c%22Oid%22%3a%227ac33778-88d2-407b-bfe1-d64366fff0e4%22%7d
Join Microsoft Teams Meeting +1 347-991-7781 United States, New York City (Toll) (866) 641-7188 (Toll-free) Conference ID: 703 977 44# Local numbers | Reset PIN | Learn more about Teams | Meeting options
when would you prefer to discuss?
@imatiach-msft I trained the algorithm again without using repartitioning for the training data. this is what I got as an error: org.apache.spark.scheduler.BarrierJobUnsupportedRDDChainException: [SPARK-24820][SPARK-24821]: Barrier execution mode does not allow the following pattern of RDD chain within a barrier stage:
Py4JJavaError Traceback (most recent call last)
@emnajaoua here is a teams link if that works for you: https://teams.microsoft.com/l/meetup-join/19%3ameeting_NmU1NjgyODEtYjBkNS00YTg2LTk2MDYtNzdkNTc2NWUxZDU2%40thread.v2/0?context=%7b%22Tid%22%3a%2272f988bf-86f1-41af-91ab-2d7cd011db47%22%2c%22Oid%22%3a%227ac33778-88d2-407b-bfe1-d64366fff0e4%22%7d
Join Microsoft Teams Meeting +1 347-991-7781 United States, New York City (Toll) (866) 641-7188 (Toll-free) Conference ID: 703 977 44# Local numbers | Reset PIN | Learn more about Teams | Meeting options
when would you prefer to discuss?
I am available now if you like :)
@emnajaoua sure we can meet right now
debugged on call, this issue looks similar to https://github.com/Azure/mmlspark/issues/569 , the workers seem to be getting out of sync. I wonder if it is due to unbalanced classes, since for some classes there are only 1 or 2 instances.
as disussed yesterday, I will try the first solution which is removing classes that have less instances. This is the list of classes that I have
they have more or less instances comparing to the previous dataframe.
So the classes now have more instances than before. I have used the same configuration as discussed yesterday in the meeting but I received the connection refused error again. Here are screenshots from the worker stderr:
I doubt that this is due to another notebook in databricks using the same resources or this cannot be the reason ? I will test again tomorrow and will update you here.
@emnajaoua "due to another notebook in databricks using the same resources or this cannot be the reason" I don't think that would be the reason. The iterations are in order here though, which looks much better than when we looked at it. How many iterations did you set, and how long did you run it for? Also, did you shuffle the dataset beforehand (I believe the classes should be distributed across all partitions such that on each partition at least one instance of each class should appear, but I could be wrong)?
also if you remove classes 12, 3, 5, 7 - do you still see the connection refused error? What if you just have classes 2 and 4 and use binary classification? Also, are you using string indexer on the label column - the labels need to start from 0 to n. I'm also wondering if you could try using multiclassova objective, although it would be much slower, it's worth trying it out: https://github.com/microsoft/LightGBM/blob/master/docs/Parameters.rst#objective
hi @emnajaoua I was able to reproduce the issue for multiclass classifier, I've put the test case in a branch:
https://github.com/imatiach-msft/mmlspark/commit/c2568b11ab6e4f74a7f349d84cb8358437eed2b2
I've confirmed the following:
1.) If the labels on a partition skip a value, eg [0 to j] inclusive, and [j+2 to k], lightgbm multiclass classifier gets stuck 2.) if parititions have different labels but some have fewer than others, eg one has 0 to k and another has 0 to k+1, lightgbm multiclass classifier finishes
My recommendation is to ensure that all partitions have all labels from 0 to total number of labels.
In your image above you have labels 12, 1, 3, 5, 4, 8, 7, 2 - but you are missing label 0, 6, 9, 10, 11. Can you try running StringIndexer or ValueIndexer, as I do in the multiclass lightgbm tests:
import com.microsoft.ml.spark.featurize.ValueIndexer
val labelizer = new ValueIndexer().setInputCol(labelColumnName).setOutputCol(labelColumnName).fit(tmpTrainData)
val labelizedData = labelizer.transform(tmpTrainData)
it looks like there is a stratified sampler by key in spark, I think that could be used as a workaround for now: https://spark.apache.org/docs/2.4.0/api/python/pyspark.html#pyspark.RDD.sampleByKey
ong did you run it for?
yes the iterations look in order. I have run it for 10 minutes and then it gave me the connection error again. I didn't shuffle the dataset beforehand. I will start by trying this and will let you know afterwards. I wanna try your suggestions one by one. Thank you so much for your help so far :)
hi @emnajaoua I was able to reproduce the issue for multiclass classifier, I've put the test case in a branch:
I've confirmed the following:
1.) If the labels on a partition skip a value, eg [0 to j] inclusive, and [j+2 to k], lightgbm multiclass classifier gets stuck 2.) if parititions have different labels but some have fewer than others, eg one has 0 to k and another has 0 to k+1, lightgbm multiclass classifier finishes
My recommendation is to ensure that all partitions have all labels from 0 to total number of labels.
In your image above you have labels 12, 1, 3, 5, 4, 8, 7, 2 - but you are missing label 0, 6, 9, 10, 11. Can you try running StringIndexer or ValueIndexer, as I do in the multiclass lightgbm tests:
import com.microsoft.ml.spark.featurize.ValueIndexer val labelizer = new ValueIndexer().setInputCol(labelColumnName).setOutputCol(labelColumnName).fit(tmpTrainData) val labelizedData = labelizer.transform(tmpTrainData)
I tried this solution using python: from pyspark.ml.feature import ValueIndexer labelizer = ValueIndexer().setInputCol('kategorie1').setOutputCol('kategorie1').fit(train) labelizedData = labelizer.transform(train)
but I get this error: ImportError: cannot import name 'ValueIndexer' although I checked that this ValueIndexer function exists here http://mmlspark.azureedge.net/docs/pyspark/ValueIndexer.html?highlight=valueindexer#module-ValueIndexer
It's under the mmlspark import namespace, not pyspark. Also you can try using StringIndexer.
Get Outlook for Androidhttps://aka.ms/ghei36
From: emnajaoua notifications@github.com Sent: Thursday, July 11, 2019 7:16:04 AM To: Azure/mmlspark mmlspark@noreply.github.com Cc: Ilya Matiach ilmat@microsoft.com; Mention mention@noreply.github.com Subject: Re: [Azure/mmlspark] java.net.ConnectException: Connection refused (Connection refused) with LightGBMClassifier in Databricks (#609)
hi @emnajaouahttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Femnajaoua&data=02%7C01%7Cilmat%40microsoft.com%7C430fa0199ab540c1572308d705f12aa8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984405668548557&sdata=%2B78%2BHNxuT%2FogBXY1BltIsf6dIoS0UvDNusn1weXWaMY%3D&reserved=0 I was able to reproduce the issue for multiclass classifier, I've put the test case in a branch:
I've confirmed the following:
1.) If the labels on a partition skip a value, eg [0 to j] inclusive, and [j+2 to k], lightgbm multiclass classifier gets stuck 2.) if parititions have different labels but some have fewer than others, eg one has 0 to k and another has 0 to k+1, lightgbm multiclass classifier finishes
My recommendation is to ensure that all partitions have all labels from 0 to total number of labels.
In your image above you have labels 12, 1, 3, 5, 4, 8, 7, 2 - but you are missing label 0, 6, 9, 10, 11. Can you try running StringIndexer or ValueIndexer, as I do in the multiclass lightgbm tests:
import com.microsoft.ml.spark.featurize.ValueIndexer val labelizer = new ValueIndexer().setInputCol(labelColumnName).setOutputCol(labelColumnName).fit(tmpTrainData) val labelizedData = labelizer.transform(tmpTrainData)
I tried this solution using python: from pyspark.ml.feature import ValueIndexer labelizer = ValueIndexer().setInputCol('kategorie1').setOutputCol('kategorie1').fit(train) labelizedData = labelizer.transform(train)
but I get this error: ImportError: cannot import name 'ValueIndexer' although I checked that this ValueIndexer function exists here http://mmlspark.azureedge.net/docs/pyspark/ValueIndexer.html?highlight=valueindexer#module-ValueIndexerhttps://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmmlspark.azureedge.net%2Fdocs%2Fpyspark%2FValueIndexer.html%3Fhighlight%3Dvalueindexer%23module-ValueIndexer&data=02%7C01%7Cilmat%40microsoft.com%7C430fa0199ab540c1572308d705f12aa8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984405668558565&sdata=LEnTVUe%2FCevD4hfHnWiqjuuDjx57ZfzRxZk%2FIjS0Kwg%3D&reserved=0
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fmmlspark%2Fissues%2F609%3Femail_source%3Dnotifications%26email_token%3DAF4KFMHJJD7RQWZF45BHDR3P64I7JA5CNFSM4H63URZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZWMBQA%23issuecomment-510443712&data=02%7C01%7Cilmat%40microsoft.com%7C430fa0199ab540c1572308d705f12aa8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984405668568574&sdata=5dczBw1VeIIiPujSRo9DIjHY2O1KCVpUZ%2Fwx6SDStMA%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAF4KFMBUUNCBJSMM4I6G4XLP64I7JANCNFSM4H63URZQ&data=02%7C01%7Cilmat%40microsoft.com%7C430fa0199ab540c1572308d705f12aa8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636984405668568574&sdata=y6STc2%2FYTmtzDeMq8ssexlHPiYW2ZaVYzszZUyzCQII%3D&reserved=0.
Si I tried shuffling the data gives this error: it didn't even get me to the training iterations process. this error occurs just after initialization. what I used as code for shuffling the data is : shuffledDF = projectYFeaturesTransformed_df_less_classes.orderBy(rand())
hi @emnajaoua I was able to reproduce the issue for multiclass classifier, I've put the test case in a branch:
I've confirmed the following:
1.) If the labels on a partition skip a value, eg [0 to j] inclusive, and [j+2 to k], lightgbm multiclass classifier gets stuck 2.) if parititions have different labels but some have fewer than others, eg one has 0 to k and another has 0 to k+1, lightgbm multiclass classifier finishes
My recommendation is to ensure that all partitions have all labels from 0 to total number of labels.
In your image above you have labels 12, 1, 3, 5, 4, 8, 7, 2 - but you are missing label 0, 6, 9, 10, 11. Can you try running StringIndexer or ValueIndexer, as I do in the multiclass lightgbm tests:
import com.microsoft.ml.spark.featurize.ValueIndexer val labelizer = new ValueIndexer().setInputCol(labelColumnName).setOutputCol(labelColumnName).fit(tmpTrainData) val labelizedData = labelizer.transform(tmpTrainData)
I tried this code with only the following classes
the code is :
labelizer = ValueIndexer(inputCol="kategorie1", outputCol="kategorie1").fit(train) labelizedData = labelizer.transform(train)
I got this error again in the stderr of one of the workers then the famous connection refused error
@imatiach-msft I have good news. When I used stratified sampling with the only classes [1, 2, 3, 4, 5, 7, 8, 12] as you suggested previously, it worked. My model was able to train, transform and then gave 0.67 accuracy as well. The exact data processing code that I used is the following
projectYFeaturesTransformed_df_less_classes = projectYFeaturesTransformed_df.filter(col("kategorie1").isin([12, 1, 3, 5, 4, 8, 7, 2])) raw_train, raw_test = spark_random_split(projectYFeaturesTransformed_df_less_classes, ratio=0.8, seed=42) columns = [c for c in projectYFeaturesTransformed_df_less_classes.columns if c != 'kategorie1']
feature_processor = FeatureHasher(inputCols=columns, outputCol='features') train = feature_processor.transform(raw_train) test = feature_processor.transform(raw_test) fractions = train.select("kategorie1").distinct().withColumn("fraction", lit(0.3)).rdd.collectAsMap() print(fractions)
sampled_df = train.stat.sampleBy("kategorie1", fractions, seed=12)
Now I need to use all the classes not juste the one that have more instances!
@imatiach-msft I am afraid to tell you that I coulnd't enjoy the success for so long because when I tried with more data but using the same process, I received this connection error again. This is the reason in the stderr behind it:
@emnajaoua I think the problem may be that some labels have only 1-2 instances, which is not enough across 40 partitions. We need to have at least 1 instance per partition. sampleBy is not with replacement, so the learner still gets stuck.
@emnajaoua sorry maybe we can debug on teams again? '[1, 2, 3, 4, 5, 7, 8, 12]' shouldn't work, the labels need to be ordered from 0 to n across all partitions. Did you use StringIndexer or ValueIndexer? I don't see that in the code above.
@emnajaoua We can use the same link if that works for you:
Join Microsoft Teams Meeting +1 347-991-7781 United States, New York City (Toll) (866) 641-7188 (Toll-free) Conference ID: 703 977 44# Local numbers | Reset PIN | Learn more about Teams | Meeting options
when would you prefer to discuss?
@emnajaoua I think the problem may be that some labels have only 1-2 instances, which is not enough across 40 partitions. We need to have at least 1 instance per partition. sampleBy is not with replacement, so the learner still gets stuck.
I don't think that this is the problem because I am still training my model with just the classes mentioned before. I used ValueIndexer as I mentioned before in the code :) the idea or suggestion of using the ValueIndexer didn#t work. what it worked is applying the stratified sampler. Please look at the code above. Give me sometime and I will call you on teams so I can show you the progress.
" shuffledDF = projectYFeaturesTransformed_df_less_classes.orderBy(rand())" that should work, I'm not sure why that is running out of memory
@emnajaoua I think you need to use both ValueIndexer and stratified sampler. That was the only thing that worked for me consistently.
" shuffledDF = projectYFeaturesTransformed_df_less_classes.orderBy(rand())" that should work, I'm not sure why that is running out of memory
this was the first thing that I tried for today but it didn#t work unfortunately
@emnajaoua I think you need to use both ValueIndexer and stratified sampler. That was the only thing that worked for me consistently.
I will try the combination of both then. the ValueIndexer and then the stratified sampler and I will let you know the result but for now the stratified sampler works! the thing is that I was always working with a part of my dataset (select where date is between date1 and date2) but when I did select then model.fit(train) didn#t work !
In fact I am confused to know the correct value that I should use in the stratified sampling here fractions = train.select("kategorie1").distinct().withColumn("fraction", lit(0.3)).rdd.collectAsMap() is there any formula for that ?
if you get the distinct values, eg 0 through 9 inclusive, you can then decide the fraction for each, eg 1/n, which would be 0.1 for each value in this case. When you have some labels with only one or two rows using 1/n is probably unreasonable. I agree that it's difficult to figure out, and it partly depends on the use case - do you want those classes with only one or two rows to be very accurate? In that case you can oversample more and increase the fraction in order to get better accuracy for those labels, which may be at the cost of accuracy for other labels.
I think using 1/n is a good starting point though, and you can iterate from there. Using 0.3 won't work above because the total won't add up to 1, eg if you have 10 labels the total will be 0.3 * 10 = 3, but the sum needs to be 1.
@emnajaoua I added a stratified repartition stage in my branch, will try to get a build to you by Monday with a combination of that transform and early termination in lightgbm if labels are incorrect on one of the partitions (instead of getting stuck): https://github.com/imatiach-msft/mmlspark/commits/ilmat/lgbm-multiclass-stuck
I validated that using that transformer fixed the issue for me locally.
I am trying to run this example with my own dataset on databricks. https://github.com/microsoft/recommenders/blob/master/notebooks/02_model/mmlspark_lightgbm_criteo.ipynb My cluster configuration is from 2 until 10 worker nodes. Worker Type is 28.GB Memory, 8 cores. In the beginning of my notebook I set the following properties
spark.conf.set("spark.executor.memory", "80g")
spark.conf.set("spark.driver.maxResultSize", "6g")
but it seems that it doesn't effect the notebook environment.I am using for the LightGBMClassifier , the library Azure:mmlspark:0.16. My dataset has 1.502.306 rows and 9 columns. It is a spark dataframe, result of 3 joins between 3 SQL Tables (transformed to spark dataframes with the command spark.sql()) I apply feature_processor step to encode the categorical columns. Then after setting the LightGBMClassifier parameter, I train the model. My LightGBMClassifier parameters are : `NUM_LEAVES = 8 NUM_ITERATIONS = 20 LEARNING_RATE = 0.1 FEATURE_FRACTION = 0.8 EARLY_STOPPING_ROUND = 5
Model name
MODEL_NAME = 'lightgbm_criteo.mml'
lgbm = LightGBMClassifier( labelCol="kategorie1", featuresCol="features", objective="multiclass", isUnbalance=True, boostingType="gbdt", boostFromAverage=True, baggingSeed=3, #früher 42 numLeaves=NUM_LEAVES, numIterations=NUM_ITERATIONS, learningRate=LEARNING_RATE, featureFraction=FEATURE_FRACTION, earlyStoppingRound=EARLY_STOPPING_ROUND, timeout=1200.0
parallelism='data_parallel'
)(Socket.java:434)
at java.net.Socket.(Socket.java:211)
at com.microsoft.ml.spark.TrainUtils$.getNodes(TrainUtils.scala:178)
at com.microsoft.ml.spark.TrainUtils$$anonfun$5.apply(TrainUtils.scala:211)
at com.microsoft.ml.spark.TrainUtils$$anonfun$5.apply(TrainUtils.scala:205)
at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29)
at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:204)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:200)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:197)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:852)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:852)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1481)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I applied the repartition trick as well before training the model
train = train.repartition(50) train.rdd.getNumPartitions()Then when I run
model = lgbm.fit(train)then I get the following error
Py4JJavaError: An error occurred while calling o1125.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 36.0 failed 4 times, most recent failure: Lost task 10.3 in stage 36.0 (TID 3493, 10.139.64.10, executor 7): java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2355) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2343) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2342) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2342) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1096) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1096) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1096) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2574) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2522) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2510) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:893) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2240) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2338) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1051) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:379) at org.apache.spark.rdd.RDD.reduce(RDD.scala:1033) at org.apache.spark.sql.Dataset$$anonfun$reduce$1.apply(Dataset.scala:1650) at org.apache.spark.sql.Dataset$$anonfun$withNewRDDExecutionId$1.apply(Dataset.scala:3409) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:99) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:228) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:85) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:158) at org.apache.spark.sql.Dataset.withNewRDDExecutionId(Dataset.scala:3405) at org.apache.spark.sql.Dataset.reduce(Dataset.scala:1649) at com.microsoft.ml.spark.LightGBMClassifier.train(LightGBMClassifier.scala:85) at com.microsoft.ml.spark.LightGBMClassifier.train(LightGBMClassifier.scala:27) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at java.net.Socket.connect(Socket.java:538) at java.net.Socket.(Socket.java:434)
at java.net.Socket.(Socket.java:211)
at com.microsoft.ml.spark.TrainUtils$.getNodes(TrainUtils.scala:178)
at com.microsoft.ml.spark.TrainUtils$$anonfun$5.apply(TrainUtils.scala:211)
at com.microsoft.ml.spark.TrainUtils$$anonfun$5.apply(TrainUtils.scala:205)
at com.microsoft.ml.spark.StreamUtilities$.using(StreamUtilities.scala:29)
at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:204)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at com.microsoft.ml.spark.LightGBMClassifier$$anonfun$3.apply(LightGBMClassifier.scala:83)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:200)
at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$5.apply(objects.scala:197)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:852)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:852)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1481)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
`
I really want to understand the reason behind this error and try the suggestions that you offer. I have been stuck on this problem since 2 weeks. I have read many similar errors, implemented some suggestions like increasing the cluster memory, configuring spark.executor.memory, repartitioning the data but still I cannot train the LightGBMClassifier with my input data.