Better GPU support. - Githubissues

rakeshvar / rnn_ctc

Recurrent Neural Network and Long Short Term Memory (LSTM) with Connectionist Temporal Classification implemented in Theano. Includes a Toy training example.

Apache License 2.0

220 stars 80 forks source link

Better GPU support. #7

Closed rakeshvar closed 8 years ago

rakeshvar commented 8 years ago

Currently training is slower on GPUs than on CPUs, because the training data is not a shared variable.

aaron-xichen commented 8 years ago

According to my experiments, even if the training data is shared variables, it seems that it is still slower on GPUs than on CPUs! Do you have time to optimize it on GPUs?

rakeshvar commented 8 years ago

I too tried moving the variable to GPU. But the bug is in Theano itself. [https://github.com/Theano/Theano/issues/1168] So the current setup seems best as in. There is no point running RNNs on GPUs with Theano.

Btw. could you get rnn_ctc to work with anything other than the Hindu numerals?

aaron-xichen commented 8 years ago

I just find the reason why it is so slow on GPUs, because the model is not complicated enough. Once I use 2000 BLSTM hidden units, the GPUs load rise to 98%, also it is 10x faster on GPUs than On CPUs as follow.

rakeshvar commented 8 years ago

@aaron-xichen , wow! I can not believe you can run 2000 BLSTM units. Can you fork this project and make changes to your branch? So that I can see how you are profiling. I have difficulty profiling. Or you could just send a gist link to the files you changed. Thanks a lot.

aaron-xichen commented 8 years ago

Do you mean 2000 BLSTM can be too slow for you to run? In fact I do not change so much, only change from 90 to 2000 of config number 10 in configurations.py. As for the profiling, I have not clean up my code yet, which is a mess, sorry for that. But I can tell you the idea, which is using multiply theano function to output the intermediate result of different stages and see the cost time, for example, cst1 for stage1, cst2 for stage1 and stage2. Then with cst2 - cst1 you can get the running time of stage2 roughly. Is that make sense?

rakeshvar commented 8 years ago

It would be much easier for me to just take a look at the profiling code and see what you are doing.

aaron-xichen commented 8 years ago

    iter_begin = time.time()
    ntwk.stage1()
    counter1 = time.time()
    st1 = counter1 - iter_begin
    ntwk.stage2()
    counter2 = time.time()
    st2 = counter2 - counter1
    ntwk.stage3()
    counter3 = time.time()
    st3 = counter3 - counter2
    ntwk.trainer()
    counter4 = time.time()
    st4 = counter4 - counter3
    print("stage1:{0:0.3f}, stage2:{1:0.3f}, stage3:{2:0.3f}, bp:{3:0.3f}".
            format(st1, st2 - st1,  st3 - st2, st4 - st3))

Is it clear enough?

rakeshvar commented 8 years ago

So you are compiling three functions? stage1, stage2 and stage3?

aaron-xichen commented 8 years ago

In fact it is running these three functions, not compiling. Compiling code is like this,

    self.stage1 = th.function(
            inputs = [],
            outputs = [layer1.output],
            givens = {
                image:x
                }
            )
    self.stage2 = th.function(
            inputs = [],
            outputs = [layer2.output],
            givens = {
                image:x
                }
            )
    self.stage3= th.function(
        inputs = [],
        outputs=[layer3.cost],
        givens = {
            image: x,
            labels: y,
            })

    self.trainer = th.function(
        inputs = [],
        outputs=[layer3.cost],
        updates=updates,
        givens = {
            image: x,
            labels: y,
            })

rakeshvar commented 8 years ago

Got it. Thanks for the profiling. My other question. Did you get any other example to work (other than Hindu numerals)?

aaron-xichen commented 8 years ago

In fact I haven't tried yet, this result is based on my own dataset, which is similar to the ascii dataset

WesleyZhang1991 commented 8 years ago

Great to hear GPU could run faster. However with GPU K20, it seems that gpu compiles a lot slower than cpu. I have to wait long and the running stage is still "building model". How long will that take to start training BLSTM 2000 on GPU?

aaron-xichen commented 8 years ago

The first time may cost a lot, maybe 1.5min on TITAN X. However it seems that theano has some cache mechanism which will accelerate the compiling stage in the following tries, which cost me less than 1 min. Hope it helps