tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.82k stars 9.54k forks source link

Add man pages for new programs #871

Closed Shreeshrii closed 6 years ago

Shreeshrii commented 7 years ago

No man pages are there for the following programs in https://github.com/tesseract-ocr/tesseract/tree/master/doc

libtool: install: /usr/bin/install -c .libs/ambiguous_words /usr/local/bin/ambiguous_words
libtool: install: /usr/bin/install -c .libs/classifier_tester /usr/local/bin/classifier_tester
libtool: install: /usr/bin/install -c .libs/cntraining /usr/local/bin/cntraining
libtool: install: /usr/bin/install -c .libs/combine_tessdata /usr/local/bin/combine_tessdata
libtool: install: /usr/bin/install -c .libs/dawg2wordlist /usr/local/bin/dawg2wordlist
libtool: install: /usr/bin/install -c .libs/lstmeval /usr/local/bin/lstmeval
libtool: install: /usr/bin/install -c .libs/lstmtraining /usr/local/bin/lstmtraining
libtool: install: /usr/bin/install -c .libs/mftraining /usr/local/bin/mftraining
libtool: install: /usr/bin/install -c .libs/set_unicharset_properties /usr/local/bin/set_unicharset_properties
libtool: install: /usr/bin/install -c .libs/shapeclustering /usr/local/bin/shapeclustering
libtool: install: /usr/bin/install -c .libs/text2image /usr/local/bin/text2image
libtool: install: /usr/bin/install -c .libs/unicharset_extractor /usr/local/bin/unicharset_extractor
libtool: install: /usr/bin/install -c .libs/wordlist2dawg /usr/local/bin/wordlist2dawg
Shreeshrii commented 7 years ago
USAGE: classifier_tester [.tr files ...]
  --debug_level  Level of Trainer debugging  (type:int default:0)
  --load_images  Load images with tr files  (type:int default:0)
  --clusterconfig_min_samples_fraction  Min number of samples per proto as % of total  (type:double default:0.625)
  --clusterconfig_max_illegal  Max percentage of samples in a cluster which have more than 1 feature in that cluster  (type:double default:0.05)
  --clusterconfig_independence  Desired independence between dimensions  (type:double default:1)
  --clusterconfig_confidence  Desired confidence in prototypes created  (type:double default:1e-06)
  --classifier  Classifier to test  (type:string default:)
  --lang  Language to test  (type:string default:eng)
  --tessdata_dir  Directory of traineddata files  (type:string default:)
  --configfile  File to load more configs from  (type:string default:)
  --D  Directory to write output files to  (type:string default:)
  --F  File listing font properties  (type:string default:font_properties)
  --X  File listing font xheights  (type:string default:)
  --U  File to load unicharset from  (type:string default:unicharset)
  --O  File to write unicharset to  (type:string default:)
  --output_trainer  File to write trainer to  (type:string default:)
  --test_ch  UTF8 test character string  (type:string default:)

from https://github.com/tesseract-ocr/tesseract/blob/master/training/classifier_tester.cpp


// This program has complex setup requirements, so here is some help:
// Two different modes, tr files and serialized mastertrainer.
// From tr files:
//   classifier_tester -U unicharset -F font_properties -X xheights
//     -classifier x -lang lang [-output_trainer trainer] *.tr
// From a serialized trainer:
//  classifier_tester -input_trainer trainer [-lang lang] -classifier x
//
// In the first case, the unicharset must be the unicharset from within
// the classifier under test, and the font_properties and xheights files must
// match the files used during training.
// In the second case, the trainer file must have been prepared from
// some previous run of shapeclustering, mftraining, or classifier_tester
// using the same conditions as above, ie matching unicharset/font_properties.
//
// Available values of classifier (x above) are:
// pruner   : Tesseract class pruner only.
// full     : Tesseract full classifier.
//            with an input trainer.)
Shreeshrii commented 7 years ago
USAGE: lstmeval [.tr files ...]
  --max_image_MB  Max memory to use for images.  (type:int default:2000)
  --debug_level  Level of Trainer debugging  (type:int default:0)
  --load_images  Load images with tr files  (type:int default:0)
  --clusterconfig_min_samples_fraction  Min number of samples per proto as % of total  (type:double default:0.625)
  --clusterconfig_max_illegal  Max percentage of samples in a cluster which have more than 1 feature in that cluster  (type:double default:0.05)
  --clusterconfig_independence  Desired independence between dimensions  (type:double default:1)
  --clusterconfig_confidence  Desired confidence in prototypes created  (type:double default:1e-06)
  --model  Name of model file (training or recognition)  (type:string default:)
  --eval_listfile  File listing sample files in lstmf training format.  (type:string default:)
  --configfile  File to load more configs from  (type:string default:)
  --D  Directory to write output files to  (type:string default:)
  --F  File listing font properties  (type:string default:font_properties)
  --X  File listing font xheights  (type:string default:)
  --U  File to load unicharset from  (type:string default:unicharset)
  --O  File to write unicharset to  (type:string default:)
  --output_trainer  File to write trainer to  (type:string default:)
  --test_ch  UTF8 test character string  (type:string default:)

USAGE: lstmeval [.tr files ...]

Should it be .lstmf files?

Shreeshrii commented 7 years ago
USAGE: lstmtraining [.tr files ...]
  --debug_interval  How often to display the alignment.  (type:int default:0)
  --train_mode  Controls gross training behavior.  (type:int default:80)
  --net_mode  Controls network behavior.  (type:int default:192)
  --perfect_sample_delay  How many imperfect samples between perfect ones.  (type:int default:4)
  --max_image_MB  Max memory to use for images.  (type:int default:6000)
  --append_index  Index in continue_from Network at which to attach the new network defined by net_spec  (type:int default:-1)
  --max_iterations  If set, exit after this many iterations  (type:int default:0)
  --debug_level  Level of Trainer debugging  (type:int default:0)
  --load_images  Load images with tr files  (type:int default:0)
  --target_error_rate  Final error rate in percent.  (type:double default:0.01)
  --weight_range  Range of initial random weights.  (type:double default:0.1)
  --learning_rate  Weight factor for new deltas.  (type:double default:0.0001)
  --momentum  Decay factor for repeating deltas.  (type:double default:0.9)
  --clusterconfig_min_samples_fraction  Min number of samples per proto as % of total  (type:double default:0.625)
  --clusterconfig_max_illegal  Max percentage of samples in a cluster which have more than 1 feature in that cluster  (type:double default:0.05)
  --clusterconfig_independence  Desired independence between dimensions  (type:double default:1)
  --clusterconfig_confidence  Desired confidence in prototypes created  (type:double default:1e-06)
  --stop_training  Just convert the training model to a runtime model.  (type:bool default:false)
  --debug_network  Get info on distribution of weight values  (type:bool default:false)
  --net_spec  Network specification  (type:string default:)
  --continue_from  Existing model to extend  (type:string default:)
  --model_output  Basename for output models  (type:string default:lstmtrain)
  --script_dir  Required to set unicharset properties or use unicharset compression.  (type:string default:)
  --train_listfile  File listing training files in lstmf training format.  (type:string default:)
  --eval_listfile  File listing eval files in lstmf training format.  (type:string default:)
  --configfile  File to load more configs from  (type:string default:)
  --D  Directory to write output files to  (type:string default:)
  --F  File listing font properties  (type:string default:font_properties)
  --X  File listing font xheights  (type:string default:)
  --U  File to load unicharset from  (type:string default:unicharset)
  --O  File to write unicharset to  (type:string default:)
  --output_trainer  File to write trainer to  (type:string default:)
  --test_ch  UTF8 test character string  (type:string default:)

USAGE: lstmtraining [.tr files ...]

Should it be .lstmf files?

Shreeshrii commented 7 years ago

USAGE: set_unicharset_properties
  --debug_level  Level of Trainer debugging  (type:int default:0)
  --load_images  Load images with tr files  (type:int default:0)
  --clusterconfig_min_samples_fraction  Min number of samples per proto as % of total  (type:double default:0.625)
  --clusterconfig_max_illegal  Max percentage of samples in a cluster which have more than 1 feature in that cluster  (type:double default:0.05)
  --clusterconfig_independence  Desired independence between dimensions  (type:double default:1)
  --clusterconfig_confidence  Desired confidence in prototypes created  (type:double default:1e-06)
  --script_dir  Directory name for input script unicharsets/xheights  (type:string default:)
  --configfile  File to load more configs from  (type:string default:)
  --D  Directory to write output files to  (type:string default:)
  --F  File listing font properties  (type:string default:font_properties)
  --X  File listing font xheights  (type:string default:)
  --U  File to load unicharset from  (type:string default:unicharset)
  --O  File to write unicharset to  (type:string default:)
  --output_trainer  File to write trainer to  (type:string default:)
  --test_ch  UTF8 test character string  (type:string default:)
Shreeshrii commented 7 years ago

USAGE: text2image
  --exposure  Exposure level in photocopier  (type:int default:0)
  --resolution  Pixels per inch  (type:int default:300)
  --xsize  Width of output image  (type:int default:3600)
  --ysize  Height of output image  (type:int default:4800)
  --margin  Margin round edges of image  (type:int default:100)
  --ptsize  Size of printed text  (type:int default:12)
  --leading  Inter-line space (in pixels)  (type:int default:12)
  --box_padding  Padding around produced bounding boxes  (type:int default:0)
  --glyph_resized_size  Each glyph is square with this side length in pixels  (type:int default:0)
  --glyph_num_border_pixels_to_pad  Final_size=glyph_resized_size+2*glyph_num_border_pixels_to_pad  (type:int default:0)
  --tlog_level  Minimum logging level for tlog() output  (type:int default:0)
  --char_spacing  Inter-character space in ems  (type:double default:0)
  --underline_start_prob  Fraction of words to underline (value in [0,1])  (type:double default:0)
  --underline_continuation_prob  Fraction of words to underline (value in [0,1])  (type:double default:0)
  --min_coverage  If find_fonts==true, the minimum coverage the font has of the characters in the text file to include it, between 0 and 1.  (type:double default:1)
  --degrade_image  Degrade rendered image with speckle noise, dilation/erosion and rotation  (type:bool default:true)
  --rotate_image  Rotate the image in a random way.  (type:bool default:true)
  --strip_unrenderable_words  Remove unrenderable words from source text  (type:bool default:true)
  --ligatures  Rebuild and render ligatures  (type:bool default:false)
  --find_fonts  Search for all fonts that can render the text  (type:bool default:false)
  --render_per_font  If find_fonts==true, render each font to its own image. Image filenames are of the form output_name.font_name.tif  (type:bool default:true)
  --list_available_fonts  List available fonts and quit.  (type:bool default:false)
  --render_ngrams  Put each space-separated entity from the input file into one bounding box. The ngrams in the input file will be randomly permuted before rendering (so
 that there is sufficient variety of characters on each line).  (type:bool default:false)
  --output_word_boxes  Output word bounding boxes instead of character boxes. This is used for Cube training, and implied by --render_ngrams.  (type:bool default:false)
  --bidirectional_rotation  Rotate the generated characters both ways.  (type:bool default:false)
  --only_extract_font_properties  Assumes that the input file contains a list of ngrams. Renders each ngram, extracts spacing properties and records them in output_base/
[font_name].fontinfo file.  (type:bool default:false)
  --output_individual_glyph_images  If true also outputs individual character images  (type:bool default:false)
  --text  File name of text input to process  (type:string default:)
  --outputbase  Basename for output image/box file  (type:string default:)
  --writing_mode  Specify one of the following writing modes.
'horizontal' : Render regular horizontal text. (default)
'vertical' : Render vertical text. Glyph orientation is selected by Pango.
'vertical-upright' : Render vertical text. Glyph  orientation is set to be upright.  (type:string default:horizontal)
  --font  Font description name to use  (type:string default:Arial)
  --unicharset_file  File with characters in the unicharset. If --render_ngrams is true and --unicharset_file is specified, ngrams with characters that are not in unicha
rset will be omitted  (type:string default:)
  --fontconfig_tmpdir  Overrides fontconfig default temporary dir  (type:string default:/tmp)
  --fonts_dir  If empty it use system default. Otherwise it overrides system default font location  (type:string default:)
Shreeshrii commented 6 years ago

closing this as a duplicate of issue filed by @jbreiden - missing manpages for v4 training binaries #1297