Closed m-kru closed 2 years ago
There is no limit on the dataset size. By default, tangram splits the dataset 80/20 for training and testing. Then, it splits the training data 90/10 for training and model comparison, often called "validation". Are you observing a bug?
Edit: this is incorrect, see the correct description here.
@nitsky My .csv
has 7377 data entries (rows). 0.8 * 7377 = 5901.6
. This does not agree with what the tangram reports:
0.9 * 7377 = 6639.3
. However, 0.9 * 6640 = 5976
, 0.8 * 6640 = 5312
. How is 5165 calculated?
ping @nitsky
hi @m-kru sorry for the delay in responding and thank you for bringing this inconsistency to our attention! What our code is currently doing is calculating the test dataset size by multiplying .2 7377 = 1475.4. Then, the rest of the rows are used for training (6640 - 1475 = 5165). I think the expectation is that it should be taking .2 6640 to get 1328 rows used for testing and 5312 rows for training. At the very least, we need to update the UI in the app to show you that your total dataset contains 7377 rows of which 737 are used for model comparison, 5312 are used for training and 1328 are used for testing. I'm working on a fix now.
@isabella can you let me know when it is fixed?
@m-kru do you mean when the ui will be updated or do you mean if/when we update the way we compute splits?
Hmm, I do not know. Is it a bug only in UI or is it some kind of bug in split calculation?
It's just a bug in the UI because as you noted the total rows don't add up! We are just forgetting to mention that some of the rows were used for comparing the models.
What is meant by "used for testing"? Does it man that tangram
does not read last X rows, and lefts them for user? In other words, what is meant by test? Is it some tamgram train
internal test, or by test you mean user tests run with tangram predict
?
"used for testing" means that Tangram sets aside those rows and uses them to compute the "test metrics" of your model. It does not use those rows to train your machine learning model. In order to accurately evaluate your machine learning model, it needs to be tested on a dataset that was NOT used to train the model. This is the "test" dataset.
I thought that the rows used for model comparison and the ones used for "test metrics" are the same rows. So, is the following drawing correct?
----------------------------------------------------------------------------------------
| Dataset |
----------------------------------------------------------------------------------------
| Data for train (80%) | Data for tests (20%) |
-----------------------------------------------------------------
| Actual data for train (90%) | Data for model comparison (10%) |
"Data for tests" is used only with single model (the best one)?
I think we misspoke in an earlier comment, this is how the splits are currently done:
-------------------------------------------------------------------------------------
| Dataset |
-------------------------------------------------------------------------------------
| Training (70%) | Model Comparison (10%) | Test (20%) |
-------------------------------------------------------------------------------------
| Actually Train (90%) | Early Stopping (10%) |
Let's say we train 10 models, we hold out the "Model Comparison", to evaluate these 10 models on this data to determine "the best" model. The "best model" is then evaluated on an unseen hold out "Test Dataset". Finally, we use a 90/10 split on the training dataset to hold out an "early stopping" dataset to determine whether we should stop training e.g. if the model stops improving on this dataset then we shouldn't keep training more epochs (in the case of a linear model) or adding more trees (in the case of a gbdt). This prevents the model from overfitting.
Thanks, finally precise description of split sizes. Maybe similar picture with % and actual values should be presented in GUI?
It seems this question has been answered, so I am going to close this issue.
Is there any limit on maximum train dataset size? I feed tangram with .csv file having almost 7800 lines with valid data, but only 6640 (including tests) are used by tangram for train.