microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.51k stars 4.28k forks source link

The training file is too big to fit into hard drive #3456

Open shallowlearner opened 6 years ago

shallowlearner commented 6 years ago

I am trying to build a model to detect punctuation in the plain text file. I have designed simple multilayer perceptron network to train the input. Feature: The design is target word 3 words before the word and 3 words after. Label: the state of the word (dot, period) and then I am getting the word embeddings of each word. So the final input file to the network is: |label: 0 0 0 1 0 |feauture: word embedding representation of 3 to the left and 3 to ther right word(210) The problem is this approach makes the input file huge, since, every word is represented in 7 x word embedding. And 3 gb file is turning into 200gb file. This brings into problem of file not fitting into hard drive. Is there any method in CNTK API to generate and feed data in the same script while not creating any intermediate file. I have used this approach: https://cntk.ai/pythondocs/Manual_How_to_feed_data.html however it seems that as I call the .train method training uptadates itself without considering the old training information.

delzac commented 6 years ago

What do you mean without considering the old training information?

shallowlearner commented 6 years ago

Define a small Logistic Regression model function

x = C.input_variable(input_dim_lr,is_sparse=False)
y = C.input_variable(num_classes_lr,is_sparse=False)
global h

def create_model(features):
num_hidden_layers = 4
hidden_layers_dim = 210
num_output_classes = 12
with C.layers.default_options(init=C.layers.glorotuniform(), activation=C.sigmoid):
global h
h = features
for
in range(num_hidden_layers):
h = C.layers.Dense(hidden_layers_dim)(h)
last_layer = C.layers.Dense(num_output_classes, activation = None)

    return last_layer(h)                                                                                                                                                                              

global ze
ze = create_model(x)

epoch_size = word_counter
lr_per_sample = 0.001
lr_schedule = C.learning_parameter_schedule_per_sample(lr_per_sample)
mm_per_sample = [0]*5 + [0.9990239141819757] # 5 epochs without momentum, then switch it on
mm_schedule = C.learners.momentum_schedule_per_sample(mm_per_sample, epoch_size=epoch_size)

loss = C.cross_entropy_with_softmax(ze, y) # applies softmax to the model output under the hood
eval_error = C.classification_error(ze, y)
print(loss)
progress_writer = C.logging.ProgressPrinter(0)
checkpoint_config = cntk.CheckpointConfig(filename= dirName + "/" + checkpoint, frequency=100)

def update_train():
global update
global ze
learner = C.learners.momentum_sgd(ze.parameters, lr_schedule, mm_schedule)
learner = C.train.distributed.data_parallel_distributed_learner(learner)
start_data = time.perf_counter()
X_train_lr, Y_train_lr = generate_data()
end_data = time.perf_counter()
print("Data generation time: ", end_data-start_data)

np.random.shuffle(X_train_lr)

train_start = time.perf_counter()                                                                                                                                                                     
train_summary = loss.train((np.multiply(np.subtract(X_train_lr,mean),InvStdDev).astype(np.float32), Y_train_lr), parameter_learners=[learner], callbacks=[progress_writer,checkpoint_config])                                                                                                                    
train_end = time.perf_counter()                                                                                                                                                                       
print("Train genaration time: ", train_end-train_start)                                                                                                                                               
update = False                                                                        

So here I divided the training session into 2 parts one is declaration of the model 2 feeding the data into model. I generate data using generate_data method and fetch into training session. The problem is every time I call the .train method, it does not add up the old values from the previous call of the update_train() function.

delzac commented 6 years ago

I'm still not quite sure what you are talking about. But i'll hazard a guess.

When you use the .train method, its assumed that you feed in the entire dataset. So, ideally, you should ever only need to call it once.

If you are calling it multiple times perhaps you are looking for trainer.train_minibatch() instead?

Ayush-Rawal commented 5 years ago

Try using streams, I'm not sure how good the support for streams is in CNTK but they are the perfect solution for this