Copy full data from CPU's memory to GPU's

borisRa commented 6 years ago

Hi ,

I am running a small network with an embedding layer and matrix multiplication running on Keras. This relatively simple network with an about 500 MB takes ages on top of DGX-1.

Analyzing this with Nvidia profiler it seems taht most of the time is spent on MemCpy, it keeps moving data from the server’s memory into the gpu. and also by switching between the different kernels ( see screenshots bellow ) .

I assume If I could transfer all my data to GPU memory at once it will improve the performance dramatically ( similar to PyTorch) .

Any help to solve this issue will be highly appreciated ! Boris

borisRa commented 5 years ago

Any one ? :)

farizrahman4u commented 5 years ago

Post code?

borisRa commented 5 years ago

@farizrahman4u

def get_model(num_of_classes,n_items ,n_users, n_latent_factors):   

    item_input = Input(shape=[1],name='Item')
    movie_embedding = Embedding(input_dim = n_items + 1, output_dim = n_latent_factors, name='Movie-Embedding')(item_input) 
    movie_vec = Flatten(name='FlattenMovies')(movie_embedding) 

    user_input = Input(shape=[1],name='User')
    user_embedding = Embedding(n_users + 1, n_latent_factors,name='User-Embedding')(user_input)
    user_vec = Flatten(name='FlattenUsers')(user_embedding) 

    dot_prod = concatenate([movie_vec, user_vec] , axis =-1) 
    input_vecs = Dropout(0.5)(dot_prod) 
    predictions = Dense(units = 1, activation='sigmoid', name = 'prediction')(input_vecs)

    model = Model(inputs = [user_input, item_input], outputs = predictions)
    model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])

    return(model)

def run_code(train , test, n_items , n_users , callbacks_list , user_id_col , item_id_col,rating_col,epochs,y_train,y_test , n_latent_factors =3):

    num_of_classes = 1 
    model = get_model(num_of_classes,n_items ,n_users, n_latent_factors)

    (callbacks_list_ans,model_path) = set_model_name_and_delete_model_from_local_storage(callbacks_list,model_name="als_as_classification_problem_implicit")
    history = model.fit([train[user_id_col], train[item_id_col] ], y= y_train , epochs= epochs , verbose=1,\
                          validation_data=( [test[user_id_col] , test[item_id_col]  ] , y_test ), callbacks=callbacks_list_ans,batch_size=1128)

    return (history )

train & test are pandas DF


earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0, mode='auto') 

model_checkpoint = ModelCheckpoint(filepath='Best_model.h5',monitor='val_loss',verbose=1,save_best_only=True,mode='auto')
callbacks_list = [earlystop,model_checkpoint]
rating_col = "rating" ; n_items = 3377  ; n_users = 6040 ; prediciton_col = "predicted_rating" ; item_id_col ="item_id" ; user_id_col ="user_id"

history =  run_code(train , test, n_items , n_users ,callbacks_list  ,user_id_col , item_id_col,rating_col , epochs=20, y_train =train[rating_col],y_test=test[rating_col]  , n_latent_factors =n_latent_factors)

borisRa commented 5 years ago

@farizrahman4u ?

keras-team / keras

Copy full data from CPU's memory to GPU's #11571