nfmcclure / tensorflow_cookbook

Code for Tensorflow Machine Learning Cookbook
https://www.packtpub.com/big-data-and-business-intelligence/tensorflow-machine-learning-cookbook-second-edition
MIT License
6.24k stars 2.41k forks source link

Using test data for training #113

Closed Keramatfar closed 6 years ago

Keramatfar commented 6 years ago

Thanks to Author, In the section "Working with bag of words", the algorithm use all the data to get vocabulary: vocab_processor = learn.preprocessing.VocabularyProcessor(sentence_size, min_frequency=min_word_freq) vocab_processor.fit_transform(texts) but maybe it is not true to use test data to train the model.

nfmcclure commented 6 years ago

Hi @Keramatfar , Thanks for the question. I should add a section explaining why we can use the whole dataset here. I'll add a formal explanation in the notebook during my code rewrite over the next few months (so I'll keep this issue open for now).

But a short explanation is that the word vector methods are not really using the target information to train the embeddings. Because of this you can think of the word vector methods as a kind of "unsupervised" method. Technically, the word vector methods are supervised, but they generate the labeled target as a sequence of tokens in a token window. But they don't use the overall-problem specific y-targets (categories of documents) to train the embeddings. Because of that, they can use the whole text. Plus it also allows us to observe the whole vocabulary in the data (increasing the observed word counts).

I hope that helps.

Keramatfar commented 6 years ago

@nfmcclure, in real world when training a model we don't have access to neither test texts nor test labels.

nfmcclure commented 6 years ago

Hi @Keramatfar,

I'm still not sure on the problem. In any problem, real or not, you have a set of data (just one set).

Then you split the dataset yourself into training and test sets. These are manually created by the problem from the single dataset that you have.

In most ML problems, you train the algorithm on the 'training set' and test it on the 'test set'. Again, both of these sets come from the original set that you decide how to split up.

Here is a similar question with some good responses, that I recommend reading through: https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set

If you have further questions about the code itself, bugs you have observed, or any features you want to see (with specifics), feel free to bring those up in a separate issue.

Github issues are not meant to address general math or high-level machine learning concepts. I'm going to close this issue for now.