neurodata / ProgLearn

NeuroData's package for exploring and using progressive learning algorithms
https://proglearn.neurodata.io
Other
35 stars 42 forks source link

Explore induced bias(interpolation & extrapolation) phenomenon in different machine learning models. #67

Open jdey4 opened 4 years ago

jdey4 commented 4 years ago

Create xor simulation data in a way similar to the experiment: https://github.com/neurodata/progressive-learning/tree/master/experiments/xor_nxor_exp. Let a = 0.5; sample 4 spherically symmetric gaussians at (a,a), (-a,a), (a,-a), and (-a,-a). Let sigma = something small, so essentially all of the mass lives between (-1,1)^2 Fit our various not-obvious parametric classifiers: knn, svm with RBF kernel, RF, xgboost, over-parameterized deepnets, etc. now, plot the posteriors in (-2,2)^2.

jshinm commented 4 years ago

please add me as an assignee. Thank you!

jovo commented 3 years ago

you can sample uniformly within quadrants, rather than gaussian, if that makes things easier

jshinm commented 3 years ago

image

picking up on my presentation today, I modified generate_gaussian_parity() so that when encountered with masses outside of boundary, such masses are replaced with newly generated random variables (obviously with same parameters used to generate initial masses), so regardless of what the sigma is, the function generates masses within the boundary. With sufficiently small sigma, there aren't many outsiders to begin with, so this should be fairly uniformly distributed as well. On that note, @jovo and @jdey4 is this okay or should I change something here?

jovo commented 3 years ago

it won't be uniform, but it will be fine.

On Wed, Sep 9, 2020 at 2:02 PM Jong Shin notifications@github.com wrote:

  • External Email - Use Caution *

[image: image] https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F52612209%2F92635235-784b1900-f2a3-11ea-9c13-e45da11fbce8.png&data=02%7C01%7Cjovo%40jhu.edu%7C8e88f1024ca944da609f08d854ea90d7%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637352713710683897&sdata=q8T7t1TeOlPVFdo5lMe1ta3p7y0ABqic1n31nqBRiok%3D&reserved=0

picking up on my presentation today, I modified generate_gaussian_parity() so that when encountered with masses outside of boundary, such masses are replaced with newly generated random variables (obviously with same parameters used to generate initial masses), so regardless of what the sigma is, the function generates masses within the boundary. With sufficiently small sigma, there aren't many outsiders to begin with, so this should be fairly uniformly distributed as well. On that note, @jovo https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjovo&data=02%7C01%7Cjovo%40jhu.edu%7C8e88f1024ca944da609f08d854ea90d7%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637352713710693896&sdata=V%2Fro6GuQUP26WMrOrFUYEpRp8mFvFIcYdC9joaZSaLY%3D&reserved=0 and @jdey4 https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjdey4&data=02%7C01%7Cjovo%40jhu.edu%7C8e88f1024ca944da609f08d854ea90d7%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637352713710703894&sdata=aa92edfoYqK8h5d%2FOfVALK%2BhTZPyzOyUARjq9XIn%2Fgs%3D&reserved=0 is this okay or should I change something here?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fneurodata%2Fprogressive-learning%2Fissues%2F67%23issuecomment-689725775&data=02%7C01%7Cjovo%40jhu.edu%7C8e88f1024ca944da609f08d854ea90d7%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637352713710703894&sdata=ORFI2Lpp4ky4wSokfdHUYdXX4XITmRv7r8RQCt8%2BBg4%3D&reserved=0, or unsubscribe https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAKG4W3XAC46HQVL7NI2ADSE67L5ANCNFSM4QR3AS2A&data=02%7C01%7Cjovo%40jhu.edu%7C8e88f1024ca944da609f08d854ea90d7%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637352713710713890&sdata=9URYb9uWs2UaP8QTi57K4CgM4y7xH6Iqucoa7VjWJ3M%3D&reserved=0 .

-- With gratitude,

Joshua T Vogelstein, PhD neurodata.io | BME@JHU https://www.bme.jhu.edu/ | dA/dt > 0 > dJ/dt https://twitter.com/neuro_data/status/1279067902658916352 where A = appreciating, J = judging, and t = time. Think I can do better? Tell me how https://forms.gle/iEad8byD89eTPdYx6!

jshinm commented 3 years ago

KNN fig1: [-1, 1] simulation data fig2: KNN trained on [-1, 1] simulation data; predicting on the same simulation data fig3: [-2, 2] simulation data fig4: prediction on [-2, 2] simulation by KNN trained on [-1, 1] simulation

Before I continue, I just wanted to confirm with @jovo, @jdey4 if this is something you were expecting. I would guess this is positive for induced bias.

jovo commented 3 years ago

looks good. see if you can draw posterior probabilities, or partitions of the feature space.

On Sat, Sep 12, 2020 at 6:01 AM Jong Shin notifications@github.com wrote:

  • External Email - Use Caution *

[image: KNN] https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F52612209%2F92992912-3cd76700-f4bc-11ea-9283-703ab8820810.png&data=02%7C01%7Cjovo%40jhu.edu%7C847ab7a5e167447e39b508d85702c1b1%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637355016632135392&sdata=8%2Fe1rm8OU4RAiZpzQ91%2Fxg5EHBlgIg4arpS9AzlmuPo%3D&reserved=0 fig1: [-1, 1] simulation data fig2: KNN trained on [-1, 1] simulation data; predicting on the same simulation data fig3: [-2, 2] simulation data fig4: prediction on [-2, 2] simulation by KNN trained on [-1, 1] simulation

Before I continue, I just wanted to confirm with @jovo https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjovo&data=02%7C01%7Cjovo%40jhu.edu%7C847ab7a5e167447e39b508d85702c1b1%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637355016632135392&sdata=%2BL8xANBsvuW3ahZeH1ea3F6n4%2BgxGpp9xXvoOTNow0E%3D&reserved=0, @jdey4 https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjdey4&data=02%7C01%7Cjovo%40jhu.edu%7C847ab7a5e167447e39b508d85702c1b1%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637355016632135392&sdata=rrgCB0%2Fu6fkSr5itUYz%2FhzoZubraPNB6awVMGV9rp4o%3D&reserved=0 if this is something you were expecting. I would guess this is positive for induced bias.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fneurodata%2Fprogressive-learning%2Fissues%2F67%23issuecomment-691460451&data=02%7C01%7Cjovo%40jhu.edu%7C847ab7a5e167447e39b508d85702c1b1%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637355016632145386&sdata=ab%2FGj7yoLmmpWDdDuzXZDL%2B%2FtY%2Ft1X%2B6eXtthiom07o%3D&reserved=0, or unsubscribe https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAKG4S2PY3PDCBEZA2UUO3SFNBF3ANCNFSM4QR3AS2A&data=02%7C01%7Cjovo%40jhu.edu%7C847ab7a5e167447e39b508d85702c1b1%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637355016632145386&sdata=clomrDD%2FcJWnG3Hvg%2B4ujCLoGXR2Mu%2BeGa0MqJDMOzw%3D&reserved=0 .

-- With gratitude,

Joshua T Vogelstein, PhD neurodata.io | BME@JHU https://www.bme.jhu.edu/ | dA/dt > 0 > dJ/dt https://twitter.com/neuro_data/status/1279067902658916352 where A = appreciating, J = judging, and t = time. Think I can do better? Tell me how https://forms.gle/iEad8byD89eTPdYx6!

jshinm commented 3 years ago

KNN Is this something you are looking for, @jovo, @jdey4?

jovo commented 3 years ago

yah!

On Mon, Sep 14, 2020 at 4:47 AM Jong Shin notifications@github.com wrote:

  • External Email - Use Caution *

[image: KNN] https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F52612209%2F93063960-fa8d6180-f644-11ea-8d8e-6078a46e2ce4.png&data=02%7C01%7Cjovo%40jhu.edu%7Ce20736c93bb64d6da8fc08d8588ad26d%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637356700545877333&sdata=38xDmMvOBYN2nBv9BpO%2F%2Fp1Hqq3opQu4EKN5hREJuMY%3D&reserved=0 Is this something you are looking for, @jovo https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjovo&data=02%7C01%7Cjovo%40jhu.edu%7Ce20736c93bb64d6da8fc08d8588ad26d%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637356700545877333&sdata=Pmqxq8Ydd6Uyl1at%2BykVdQMwdU7TeEQ4e4DAHuYPvRw%3D&reserved=0, @jdey4 https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjdey4&data=02%7C01%7Cjovo%40jhu.edu%7Ce20736c93bb64d6da8fc08d8588ad26d%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637356700545887331&sdata=7pcTSwspkWuKCS%2BYGSkKsxUCzBtckcpDOzXmltOmPGQ%3D&reserved=0 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fneurodata%2Fprogressive-learning%2Fissues%2F67%23issuecomment-691914235&data=02%7C01%7Cjovo%40jhu.edu%7Ce20736c93bb64d6da8fc08d8588ad26d%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637356700545887331&sdata=5kif%2BGoRtvAAOED%2BOLVvCb3Mn1%2BXlRlOcyTE1ld6k7U%3D&reserved=0, or unsubscribe https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAKG4WQFYQUX2MJN3D72X3SFXKCJANCNFSM4QR3AS2A&data=02%7C01%7Cjovo%40jhu.edu%7Ce20736c93bb64d6da8fc08d8588ad26d%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637356700545897324&sdata=OBGjT1%2FRzebh7f5ZsWmQs%2FHQylJuP%2BeXhRa2r5UkHKI%3D&reserved=0 .

-- With gratitude,

Joshua T Vogelstein, PhD neurodata.io | BME@JHU https://www.bme.jhu.edu/ | dA/dt > 0 > dJ/dt https://twitter.com/neuro_data/status/1279067902658916352 where A = appreciating, J = judging, and t = time. Think I can do better? Tell me how https://forms.gle/iEad8byD89eTPdYx6!

jdey4 commented 3 years ago

@jong here the true pdf was calculated: https://github.com/jdey4/progressive-learning/blob/master/replaying/result/figs/true_pdf.pdf. I am going to replicate it in the current repo. @jovo please correct me if I am wrong. The goal is to check the difference between true and estimated pdf. And to get the estimated pdf, you need to iterate over the leaves to know their bounds and posteriors.

jovo commented 3 years ago

the goal is to look at the posterior in the expanded space using different algorithms.

On Mon, Sep 14, 2020 at 11:01 AM Jayanta Dey notifications@github.com wrote:

  • External Email - Use Caution *

@jong https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjong&data=02%7C01%7Cjovo%40jhu.edu%7Cb711449f7a894622657608d858bf2f33%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637356926641530485&sdata=dfsgSvQ6R%2BDrD31Zawu7Jdf5jTSzVwpQH3ngd8ozL%2FM%3D&reserved=0 here the true pdf was calculated: https://github.com/jdey4/progressive-learning/blob/master/replaying/result/figs/true_pdf.pdf https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjdey4%2Fprogressive-learning%2Fblob%2Fmaster%2Freplaying%2Fresult%2Ffigs%2Ftrue_pdf.pdf&data=02%7C01%7Cjovo%40jhu.edu%7Cb711449f7a894622657608d858bf2f33%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637356926641540479&sdata=Rvz%2BQi8Hsq9OM58lQGOYnAHPgT2CJS%2FI3yyk%2F5idv%2BQ%3D&reserved=0. I am going to replicate it in the current repo. @jovo https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjovo&data=02%7C01%7Cjovo%40jhu.edu%7Cb711449f7a894622657608d858bf2f33%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637356926641550469&sdata=R3QUWzXbJlCVyFd7y6DarrXI0wr5pLNhqiHBABTcK2c%3D&reserved=0 please correct me if I am wrong. The goal is to check the difference between true and estimated pdf. And to get the estimated pdf, you need to iterate over the leaves to know their bounds and posteriors.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fneurodata%2Fprogressive-learning%2Fissues%2F67%23issuecomment-692115766&data=02%7C01%7Cjovo%40jhu.edu%7Cb711449f7a894622657608d858bf2f33%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637356926641550469&sdata=CTcqGABfqqe2i5lJ%2FR%2F3mdDW0LdhVLwfZXhctR9pye0%3D&reserved=0, or unsubscribe https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAAKG4VRUZEJAARDYBC3IU3SFYV47ANCNFSM4QR3AS2A&data=02%7C01%7Cjovo%40jhu.edu%7Cb711449f7a894622657608d858bf2f33%7C9fa4f438b1e6473b803f86f8aedf0dec%7C0%7C0%7C637356926641560467&sdata=QxZwQpR9FmHHbHE2DqVNdYQR%2BR0PsI5wSTQnaMY%2BDHg%3D&reserved=0 .

-- With gratitude,

Joshua T Vogelstein, PhD neurodata.io | BME@JHU https://www.bme.jhu.edu/ | dA/dt > 0 > dJ/dt https://twitter.com/neuro_data/status/1279067902658916352 where A = appreciating, J = judging, and t = time. Think I can do better? Tell me how https://forms.gle/iEad8byD89eTPdYx6!

jshinm commented 3 years ago

KNN Thank you, @jdey4 for taking your time to help me understand this today! As you suggested, I computed the actual posterior using predict_proba() and the outputs are plotted as figures 5 and 6.

To clarify myself for @jovo,

fig2 is the KNN trained on the [-1,1] toy data predicting on the same toy data. I believe the decision boundaries you see here are the partitions of feature space you asked for. Please correct me if I am wrong.

fig4 is the same KNN trained on the [-1,1] toy data predicting on a different toy data in the range of [-2,2].

fig5 and fig6 are the actual posterior computed by the function that outputs posterior probabilities for each feature for each mass (ex. the posterior for the mass at [0,0] would be [0.5,0,5]). When plotted, idx[0] is subtracted from idx[1], therefore, the final value for the mass at [0,0] would be 0 (white), thus the colorbar scale going from -1 to 1 (blue and red, respectively). Of course, these figures are generated by the same KNN trained on the [-1,1] toy data used for fig2 and fig4.

jdey4 commented 3 years ago

@jshin13 do not subtract anything to show the posteriors. Use divergent colormap as here: https://github.com/jdey4/progressive-learning/blob/master/replaying/xor_nxor_pdf.py

jshinm commented 3 years ago

KNN @jdey4 hey Jayanta, I corrected the colormap scale. Just to clarify, the subtraction was to change the scale from [0,1] to [-1,1], not to normalize the colormap. I'm actually using a divergent colormap(RdBu_r) here. But, obviously negative scale is incorrect, and for that, thank you for your correction!

jdey4 commented 3 years ago

@jshin13 I calculated the true distribution here. You can have a look: https://github.com/neurodata/progressive-learning/blob/master/experiments/sim_pdf/XOR_pdf.ipynb

jshinm commented 3 years ago

@jovo, after reviewed by @jdey4 today, I am sharing the results from the spiral experiment. Here, one thing I want to confirm with you is if the trend I am seeing is okay, that is, all the posteriors do not expand out to [-2,2] range and rather makes spirals within the range of [-1,1]. Please let me know at your leisure, so I can share this with Dr. Isik.

20200923_sprial_model_param.pdf