optimizations for ARM - Githubissues

githubharald / CTCWordBeamSearch

Connectionist Temporal Classification (CTC) decoder with dictionary and language model.

https://towardsdatascience.com/b051d28f3d2e

MIT License

557 stars 160 forks source link

optimizations for ARM #40

Closed iit2014128 closed 4 years ago

iit2014128 commented 4 years ago

Has anyone tried this algorithm on ARM Architecture? It is taking very long(around 7 secs) for input dimension 700*80 with 100 BeamWidth in ARM processor which is around 5 times in comparison to x86 architecture(1.4 secs) for the same hyperparameters. Any optimization we can do to reduce execution time in an ARM to bring it down at least equal to x86?

githubharald commented 4 years ago

Which mode do you use? I would suggest only using "Words" or "NGrams" mode, as they are much faster than the forecast modes, while still achieving good accuracy. Then, limit the beam width (see README): at some point, increasing it only accounts for a small accuracy improvement, while slowing down the algorithm quite a lot. Something around 30 should give a reasonable trade-off.

iit2014128 commented 4 years ago

We are using Words mode .

githubharald commented 4 years ago

and beam width?

iit2014128 commented 4 years ago

100

githubharald commented 4 years ago

try to go down to 30.

githubharald commented 4 years ago

and about which hardware are we talking? there is a wide range of ARM processors. can you give more details?

iit2014128 commented 4 years ago

for beam width 30 it is working in 2secs but CER is going down by 5%.

githubharald commented 4 years ago

did you compile with parallel mode? https://github.com/githubharald/CTCWordBeamSearch#1-compile if you have a batch with multiple elements, this might also improve the runtime.
please see my last question about hardware

iit2014128 commented 4 years ago

yes trying to get more info about hardware, will share in a minute

iit2014128 commented 4 years ago

ARMV7 processor rev 0(v7l)

iit2014128 commented 4 years ago

No, We didn't compile in parallel mode. we are not using Tensorflow.

iit2014128 commented 4 years ago

How can we use parallel for the c++ test program?

githubharald commented 4 years ago

parallel mode is only implemented for TF. And it also makes sense in case a batch is processed. Do you use batches, or do you process single elements (e.g., just one input image a time)?

iit2014128 commented 4 years ago

No, we are not using batches. we are directly feeding logits. (input to model is array of (x,y) pen co-ordinates. we are not using images). Any suggestions to optimize in case of single element ?

githubharald commented 4 years ago

Start with the simple things:

search for a good beam-width, which is both fast and accurate.
make sure to use some fast way to pass the data into the C++ program. CSV files are used in the test program, but not a good idea if runtime matters
Ideally, plug the C++ code into the main program, as I did for TF. Then, the data is directly passed from TF to the C++ program, instead of writing a file with the data

If this does not help, then there is no way around profiling the program on the hardware and searching for performance bottle-necks.

iit2014128 commented 4 years ago

we tried profiling this program and find out that push_back in vector<vector<>> ex. wordList is taking majority of the time. is there any alternative to this ? or any way we can make it faster?

githubharald commented 4 years ago

the variable wordList is only used when the language model is created - which should only happen once (while initialization). Are you creating the language model for each sample you decode? Or are you even starting the program for each input file, running it to decode the sample, and then terminating it again?

iit2014128 commented 4 years ago

we are creating language model only once. newBeam->wordHist.push_back(newBeam->m_wordDev) this is getting called approx 160000 times. so this is taking more time while wordList is getting called aroung 65000 times(approx).

githubharald commented 4 years ago

you said that you use Words mode. But the code newBeam->m_wordHist.push_back(newBeam->m_wordDev) is not called in words mode. Please clarify.

iit2014128 commented 4 years ago

Sorry previously we were using words but now switched to Ngram mode.

githubharald commented 4 years ago

you could try to move m_wordDev instead of copying it - add a std::move and comment out the second line:

newBeam->m_wordHist.push_back(std::move(newBeam->m_wordDev));
//newBeam->m_wordDev.clear();

githubharald commented 4 years ago

closing because of inactivity.