harinath-palavalli / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

shapeclustering - is it parallelizable? #1274

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
I am training tesseract 3.03 with a few fonts, but a lot of training text data 
(built via text2image).  Everything trains fine, but in this scenario, I have 
two fonts, and each .tr file is about 750MB of data.

shapeclustering has been running for 30 minutes and it appears it will take at 
least another 90 minutes to finish.

is it parallelizable?  That is, would it be possible, to pass a parameter,

e.g.,

--cores=5  

Original issue reported on code.google.com by monte.sh...@gmail.com on 14 Aug 2014 at 1:27

GoogleCodeExporter commented 9 years ago
Here is the character-level frequencies of the file being clustered across two 
fonts.

FREQ    CHAR
1801    0
3209    1
2384    2
1681    3
1443    4
1742    5
1179    6
1180    7
1229    8
1182    9
1031    F
67317   e
7080    b
4037    .
5941    ,
237     W
218     H
729     E
402     L
622     S
1718    T
508     R
212     U
550     C
1187    I
144     B
982     A
354     P
41145   i
20000   l
21805   d
12890   p
52204   t
30494   s
40767   o
35508   r
39025   n
6526    y
102     J
474     O
661     N
154     V
38449   a
12544   f
20174   c
25493   h
513     j
11286   m
12340   g
355     G
2473    -
263     M
280     D
6370    w
6454    v
13664   u
886     x
172     :
1762    k
55  K
103     Y
287     ;
467     z
368     q
5   Z &
43  \
14  Q
46  + 

Original comment by monte.sh...@gmail.com on 14 Aug 2014 at 2:10

GoogleCodeExporter commented 9 years ago
shapetable took a little over 12 hours to run on the above information.  A 
single core is being used on a server that has 24 cores and plenty of RAM.

mftraining is now running with an expected similar timeframe.  Can this not be 
faster?

Original comment by monte.sh...@gmail.com on 15 Aug 2014 at 8:09