Maximum train dataset size.

m-kru commented 2 years ago

Is there any limit on maximum train dataset size? I feed tangram with .csv file having almost 7800 lines with valid data, but only 6640 (including tests) are used by tangram for train.

nitsky commented 2 years ago

There is no limit on the dataset size. By default, tangram splits the dataset 80/20 for training and testing. Then, it splits the training data 90/10 for training and model comparison, often called "validation". Are you observing a bug?

Edit: this is incorrect, see the correct description here.

m-kru commented 2 years ago

@nitsky My .csv has 7377 data entries (rows). 0.8 * 7377 = 5901.6. This does not agree with what the tangram reports:

0.9 * 7377 = 6639.3. However, 0.9 * 6640 = 5976, 0.8 * 6640 = 5312. How is 5165 calculated?

m-kru commented 2 years ago

ping @nitsky

isabella commented 2 years ago

hi @m-kru sorry for the delay in responding and thank you for bringing this inconsistency to our attention! What our code is currently doing is calculating the test dataset size by multiplying .2 7377 = 1475.4. Then, the rest of the rows are used for training (6640 - 1475 = 5165). I think the expectation is that it should be taking .2 6640 to get 1328 rows used for testing and 5312 rows for training. At the very least, we need to update the UI in the app to show you that your total dataset contains 7377 rows of which 737 are used for model comparison, 5312 are used for training and 1328 are used for testing. I'm working on a fix now.

m-kru commented 2 years ago

@isabella can you let me know when it is fixed?

isabella commented 2 years ago

@m-kru do you mean when the ui will be updated or do you mean if/when we update the way we compute splits?

m-kru commented 2 years ago

Hmm, I do not know. Is it a bug only in UI or is it some kind of bug in split calculation?

isabella commented 2 years ago

It's just a bug in the UI because as you noted the total rows don't add up! We are just forgetting to mention that some of the rows were used for comparing the models.

m-kru commented 2 years ago

What is meant by "used for testing"? Does it man that tangram does not read last X rows, and lefts them for user? In other words, what is meant by test? Is it some tamgram train internal test, or by test you mean user tests run with tangram predict?

isabella commented 2 years ago

"used for testing" means that Tangram sets aside those rows and uses them to compute the "test metrics" of your model. It does not use those rows to train your machine learning model. In order to accurately evaluate your machine learning model, it needs to be tested on a dataset that was NOT used to train the model. This is the "test" dataset.

m-kru commented 2 years ago

I thought that the rows used for model comparison and the ones used for "test metrics" are the same rows. So, is the following drawing correct?

----------------------------------------------------------------------------------------
|                                    Dataset                                           |
----------------------------------------------------------------------------------------
|                    Data for train (80%)                       | Data for tests (20%) |
-----------------------------------------------------------------
| Actual data for train (90%) | Data for model comparison (10%) |

"Data for tests" is used only with single model (the best one)?

isabella commented 2 years ago

I think we misspoke in an earlier comment, this is how the splits are currently done:

-------------------------------------------------------------------------------------
| Dataset                                                                           |
-------------------------------------------------------------------------------------
| Training (70%)                              | Model Comparison (10%) | Test (20%) |
-------------------------------------------------------------------------------------
| Actually Train (90%) | Early Stopping (10%) |

Let's say we train 10 models, we hold out the "Model Comparison", to evaluate these 10 models on this data to determine "the best" model. The "best model" is then evaluated on an unseen hold out "Test Dataset". Finally, we use a 90/10 split on the training dataset to hold out an "early stopping" dataset to determine whether we should stop training e.g. if the model stops improving on this dataset then we shouldn't keep training more epochs (in the case of a linear model) or adding more trees (in the case of a gbdt). This prevents the model from overfitting.

m-kru commented 2 years ago

Thanks, finally precise description of split sizes. Maybe similar picture with % and actual values should be presented in GUI?

nitsky commented 2 years ago

It seems this question has been answered, so I am going to close this issue.

modelfoxdotdev / modelfox

Maximum train dataset size. #77