tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

Incorrect/outdated documentation in README.md #316

Open pratheesh-prakash opened 1 year ago

pratheesh-prakash commented 1 year ago

In general, the documentation provided in README.md is very vague, and doesn't explain the training parameters and their impact on the output model.

Apart from the above, the information provided in the README.md is incorrect and outdated. Here are some major issues I have noticed.

Line 126 of README.md says

FINETUNE_TYPE Finetune Training Type - Impact, Plus, Layer or blank. Default: ''

However, Makfile doesn't seem to have any method to make use of this parameter. The help documentation (available through make help) also misses out this line. Is it because this option is unavailable in the later versions, or is it because the Makefile is outdated? Additionally, there is no information whatsoever on how these arguments (i.e. Plus, layer or '') would influence the training.

For plotting CER, according to README.md, the user must run './plot/plot_cer.sh'. Unfortunately, there exists no such shell-script in `plot'. Additionally, the python scripts provided in 'plot' would work only if the log-file is parsed to produce a csv.

The documentation also misses on how to interpret the results, how to optimise the hyperparameters, and how to improve the training data (For eg: how can we prevent 'Compute CTC targets failed' errors.).

It would be great if README.md is updated with latest information, and a more clear and detailed explanation of various parameters are provided.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stweil commented 1 year ago

@pratheesh-prakash, do you want to send a pull request which improves that documentation?

pratheesh-prakash commented 1 year ago

@stweil: I really wish I could contribute to tesseract-ocr. But I do not have in-depth knowledge on the issues which I have raised. I have checked the documentation only to clarify those doubts, and found this information either missing or outdated in the documentation. I would suggest that the update be done by someone among the developers.

zdenop commented 1 year ago

Some details/explanation of whats happened is in #257.