linkedin / photon-ml

A scalable machine learning library on Apache Spark
Other
793 stars 185 forks source link

Configuration for nested random effects #347

Closed mogismog closed 6 years ago

mogismog commented 6 years ago

Hi all,

First, thanks a lot for the great framework! It's been a fun experience learning and enabling fitting mixed effects models at scale.

I had a question about setting up configurations correctly for nested random effects. We have some simulated data (150 samples) with a response (response) across five locations (location) with 10 different treatments (tx) and three randomized blocks (block), where the blocks each represent something different across locations (e.g. block 1 at location 1 does not mean the same thing as block 1 at location 2). We're interested in a nested random effects of block nested in location, with a fixed effects term of treatment.

In lme4, I would fit a nested model with something like the following formula:

response ~ 1 + treatment + (1|location) + (1|location:block)

In photon-ml, we're fitting the model with the following coordinate and feature shard configurations:

--feature-shard-configurations=intercept=false,feature.bags=treatmentFeatures,name=treatmentShard
--feature-shard-configurations=intercept=true,name=dummyShard
--coordinate-configurations=reg.alpha=0.0,optimizer=LBFGS,feature.shard=dummyShard,active.data.bound=100000000,min.partitions=10,tolerance=1e-3,max.iter=10,regularization=NONE,passive.data.bound=1,random.effect.type=location,reg.weights=0.0,features.to.sample.ratio=1.0,name=per_location
--coordinate-configurations=reg.alpha=0.0,optimizer=LBFGS,feature.shard=dummyShard,active.data.bound=100000000,min.partitions=10,tolerance=1e-3,max.iter=10,regularization=NONE,passive.data.bound=1,random.effect.type=location_block,reg.weights=0.0,features.to.sample.ratio=1.0,name=per_location_block
--coordinate-configurations=reg.alpha=0.0,optimizer=LBFGS,feature.shard=treatmentShard,min.partitions=10,tolerance=1e-3,max.iter=10,regularization=NONE,down.sampling.rate=1.0,reg.weights=0.0,name=fixed_treatment
--coordinate-update-sequence=fixed_treatment,per_location,per_location_block

Where treatmentFeatures is a feature column of the format [[name, value, term]], e.g. [[treatment, 1.0, 1]] as an indicator feature representing treatment 1, [[treatment, 1.0, 2]] represents treatment 2, and so forth. location_block is a column that is simply a concatenation of the location and block columns.

As you'd imagine with a small, simulated, balanced dataset, the training process doesn't take much time and isn't affected by optimizer type. Since we're looking for results parity (and a way to help inform how to set up the configurations in Photon-ML for different models), we decided to see the difference in results between Photon-ML, lme4, and SAS (with the understanding that some differences in results are likely due to different optimization strategies, implementations, etc., and that neither lme4/SAS is "right" and Photon-ML is "wrong" and we're ok with that!). The fixed effects terms matched up perfectly (which makes sense)! The random intercept on location matches up almost perfectly (~3% relative difference), which is fantastic! However, the term involving blocks nested within location is way off. For context, the output from lme4 (which matches up with SAS):

> ranef(model)
$`location:block`
    (Intercept)
1:1   1.6342373
1:2  -2.8263608
1:3   2.3015123
2:1   2.5182206
2:2  -3.5003984
2:3   0.7428217
3:1  -1.6084114
3:2   1.0149569
3:3   1.8039981
4:1   3.9875396
4:2  -4.3290318
4:3  -0.2138923
5:1   3.2088730
5:2  -2.8751020
5:3  -1.8589628

And the means from Photon-ML:

+-------+-------------------------------------+
|modelId|means                                |
+-------+-------------------------------------+
|3_1    |[[(INTERCEPT),,-4.4342528449387855]] |
|3_2    |[[(INTERCEPT),,1.3476092917129108]]  |
|3_3    |[[(INTERCEPT),,3.0866435532258887]]  |
|2_1    |[[(INTERCEPT),,5.725963814862967]]   |
|2_2    |[[(INTERCEPT),,-7.5389767908168395]] |
|2_3    |[[(INTERCEPT),,1.8130129759538705]]  |
|1_1    |[[(INTERCEPT),,2.786807928159726]]   |
|1_2    |[[(INTERCEPT),,-7.0442793268065715]] |
|1_3    |[[(INTERCEPT),,4.257471398646831]]   |
|5_1    |[[(INTERCEPT),,8.192804618365649]]   |
|5_2    |[[(INTERCEPT),,-5.216179562752159]]  |
|5_3    |[[(INTERCEPT),,-2.976625055613559]]  |
|4_1    |[[(INTERCEPT),,9.19649340544773]]    |
|4_2    |[[(INTERCEPT),,-9.133097642328464]]  |
|4_3    |[[(INTERCEPT),,-0.06339576311926862]]|
+-------+-------------------------------------+

Which seems to be the deviation between the mean of response and the group means of location_block, which isn't quite what I'm looking for. This leads me to believe I am setting up something in the coordinate configuration and/or features configuration incorrectly.

Some other things I've tried without success include:

Any sort of guidance would be greatly appreciated. For what it's worth, I'll admit I may be totally misunderstanding either mixed effects models or how Photon-ML works, and if that's the case I'll totally own up to it and close this ticket. I'll also be happy to provide the avro file of the simulated data if that will make things easier.

Thanks a lot!

ashelkovnykov commented 6 years ago

Hello Francisco,

Thanks for the detailed explanation, and we're glad you're trying out Photon.

Your configurations look fine. My current hypothesis is that this is being caused by an implementation difference between Photon and lme4. However, I'm not familiar with the inner workings of lme4, so this hypothesis should be investigated. Would you mind sharing the Avro file of the simulated data so that we can take a look?

Unrelated questions: I see that you're using the command line interface for Photon ML. Just out of curiosity, have you tried using the API as well? Is there a particular reason you chose to use the command line interface (e.g. data already Avro formatted, didn't want to write own driver class, etc.)?

Thanks for the question, looking forward to hearing back from you.

mogismog commented 6 years ago

Hey Alex,

Thanks a lot for the response!

Your configurations look fine. My current hypothesis is that this is being caused by an implementation difference between Photon and lme4. However, I'm not familiar with the inner workings of lme4, so this hypothesis should be investigated. Would you mind sharing the Avro file of the simulated data so that we can take a look?

I've attached the avro file to this comment (I went ahead and zipped it up simply because GitHub doesn't allow attaching avro files). I would agree that lme4 is probably doing something behind the scenes (and I'll also admit that I'm not very familiar with the inner workings). What's confusing is that lme4 and three other implementations (ASReml, SAS, and Python's statsmodels) are all providing the same results. It's likely they may all use the same general philosophical approach to fitting mixed models, of course.

Unrelated questions: I see that you're using the command line interface for Photon ML. Just out of curiosity, have you tried using the API as well? Is there a particular reason you chose to use the command line interface (e.g. data already Avro formatted, didn't want to write own driver class, etc.)?

So, we've written Python wrappers to interact and call the Photon-ML driver so we can use it from within Pyspark/Jupyter notebooks. I used the CLI syntax to avoid confusion, but FWIW I also trained the model described above using both our Python wrappers and the CLI and got back the same fitted coefficients.

As another data point, I fit the following model with a fixed term and a crossed random term across multiple frameworks (knowing that it makes zero sense in the context of the provided data):

response ~ 0 + treatment + (1|location:block)

And got back roughly the same fitted coefficients as Photon-ML (with the largest absolute difference between the crossed random term being like 1.25, which is really great IMO). This leads me to believe that I'm doing something wrong with setting up the nested random effects term.

Let me know if I've forgotten any other information that may be helpful!

simulated-data.avro.zip

ashelkovnykov commented 6 years ago

Hello Francisco,

I haven't forgotten you, I've just been very busy. I'd still like to help you, though won't have any spare cycles for some more time. However, I thought that I could ask you to run two experiments to test a hypothesis.

Could you please run the per-location and per-location-block coordinates on their own - that is to say one run with these settings:

--feature-shard-configurations=intercept=true,name=dummyShard
--coordinate-configurations=reg.alpha=0.0,optimizer=LBFGS,feature.shard=dummyShard,active.data.bound=100000000,min.partitions=10,tolerance=1e-3,max.iter=10,regularization=NONE,passive.data.bound=1,random.effect.type=location,reg.weights=0.0,features.to.sample.ratio=1.0,name=per_location
--coordinate-update-sequence=per_location

And another with these:

--feature-shard-configurations=intercept=true,name=dummyShard
--coordinate-configurations=reg.alpha=0.0,optimizer=LBFGS,feature.shard=dummyShard,active.data.bound=100000000,min.partitions=10,tolerance=1e-3,max.iter=10,regularization=NONE,passive.data.bound=1,random.effect.type=location_block,reg.weights=0.0,features.to.sample.ratio=1.0,name=per_location_block
--coordinate-update-sequence=per_location_block

Please let me know what the results are when you have time.

mogismog commented 6 years ago

Hey Alex,

Not a big deal about the delayed response. I know time is very valuable and often scarce, so I totally understand juggling other higher priority tasks. Any time you're able to spend on this is greatly appreciated!

I trained models using the two separate configurations you provided, here are the results:

per_location means:

+-------+-----------------------------------+
|modelId|means                              |
+-------+-----------------------------------+
|1      |[[(INTERCEPT),,236.46013110849765]]|
|2      |[[(INTERCEPT),,209.7350284162282]] |
|3      |[[(INTERCEPT),,238.46449089473782]]|
|4      |[[(INTERCEPT),,203.47299181026435]]|
|5      |[[(INTERCEPT),,184.25646191709305]]|
+-------+-----------------------------------+
per_location_block means:

+-------+-----------------------------------+
|modelId|means                              |
+-------+-----------------------------------+
|1_1    |[[(INTERCEPT),,239.2469390366577]] |
|1_2    |[[(INTERCEPT),,229.4158517816914]] |
|1_3    |[[(INTERCEPT),,240.71760250714456]]|
|2_1    |[[(INTERCEPT),,215.46099223109124]]|
|2_2    |[[(INTERCEPT),,202.1960516254116]] |
|2_3    |[[(INTERCEPT),,211.54804139218209]]|
|3_1    |[[(INTERCEPT),,234.03023804979912]]|
|3_2    |[[(INTERCEPT),,239.8121001864508]] |
|3_3    |[[(INTERCEPT),,241.55113444796376]]|
|4_1    |[[(INTERCEPT),,212.66948521571217]]|
|4_2    |[[(INTERCEPT),,194.33989416793602]]|
|4_3    |[[(INTERCEPT),,203.4095960471454]] |
|5_1    |[[(INTERCEPT),,192.4492665354589]] |
|5_2    |[[(INTERCEPT),,179.0402823543409]] |
|5_3    |[[(INTERCEPT),,181.2798368614795]] |
+-------+-----------------------------------+

Which both match up incredibly well with other frameworks when using the same model structure of an intercept-only random effects term and no grand mean (e.g. response ~ 0 + (1|location) using lme4 notation).

Interestingly enough, when I include a constant fixed effects feature (e.g. a grand mean) in the Photon-ML configuration for the above two models, each of the per_location and per_location_block means are off from the output of other frameworks by a constant relative factor of 1.026 and 1.0747 respectively. This makes me wonder if the "shrinkage" factor that's a function of the variance components of each term isn't being included? Alternatively, I'm not utilizing the variance information from the model output correctly.

Let me know if there's anything else I can provide! Thanks again!

ashelkovnykov commented 6 years ago

Hello Francisco,

Great, the means matching up with the other frameworks when running the above configurations leads me to believe that my hypothesis (which follows) is correct. It appears to me that the other frameworks will train the fixed and random effect components entirely independent of each other. The GAME algorithm in Photon ML, on the other hand, will not: after training the first component of the model (using the update sequence: in this case, the fixed effect model), GAME will then compute the residual for each sample and add it to the offset when training the next component of the model. It will also do the same thing for the third component, the fourth, etc.

This is not any sort of configuration error on your part - merely a difference in the algorithms between mixed effect implementations. I'll link some relevant lines in the code below:

Unrelated: I wouldn't recommend putting too much stock in the variances we were providing prior to PR #349. Prior to the aforementioned PR, we were using the inverse of the Hessian diagonal to approximate the variances and depending on the scale of the results, these approximations could be quite poor indeed!