tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

Plotting – again #377

Closed bertsky closed 2 months ago

bertsky commented 4 months ago

After several attempts by @Shreeshrii to share her excellent plotting scripts, each of which was unfortunately thwarted by bad circumstances (other big changes occurring at the same time), here comes a plotting facility again.

I based this on the ocrddata branch of her fork, cherry-picking only the two relevant changesets, resolving conflicts and then refactoring to make this better fit our makefileization.

Usage is simply make plot, which will only work after make training. (I could also make this dependency explicit, but that would cause make plot to start the training if it did not happen already for that combination of variables.)

The output files will be created in $OUTPUT_DIR/$MODEL_NAME.plot_log.png, e.g. herrnhut-kurrent tess finetuned-htrbin plot_log

and $OUTPUT_DIR/$MODEL_NAME.plot_cer.png, e.g. herrnhut-kurrent tess finetuned-htrbin plot_cer

All intermediate files (except for the lstmeval log files generated under $OUTPUT_DIR/eval/*.log because they are valuable in their own right) are marked as such and therefore removed by make.

Perhaps we should discuss how both plots could be combined into a single one (which is probably what @Shreeshrii tried to do already) – I can see that there's a problem by the granularity these data points are recorded (training iterations for validation during lstmtraining vs. learning iterations for validation afterwards via external lstmeval). But IIUC we have everything it takes to be able to combine them (twin y plot with synced x axes)...

bertsky commented 4 months ago

With the last commit I did re-instante @Shreeshrii's LOG_FILE variable.

The big pro is that thus you can opt in to plotting even older logs, e.g.

make plot LOG_FILE=nohup.out
zdenop commented 3 months ago

I just make quick test on openSUSE (15.5) and here are a few suggestions:

stweil commented 3 months ago

It would be good to mention that user should install matplotlib and pandas (pip3 install matplotlib pandas) before running make plot.

Both are mentioned in requirements.txt, so running pip3 install -r requirements.txt is sufficient.

bertsky commented 3 months ago

@zdenop very good points – thanks! I'll address these in a few follow-up commits.

bertsky commented 3 months ago

some of the commits could already be applied to the main branch. Would it be okay if I cherry-pick them? You (or I) would have to rebase your plotting branch after that. I think we can improve plotting faster by getting it integrated like that.

@stweil why the hurry all of a sudden? This does not bode well with my workflow, and I'm already in too many places at a time...

bertsky commented 3 months ago

All done. @zdenop do you want me to add some target for doing the pip install requirements, too?

Also, how about adding the result of the plotting for the ocrd-testset training into the repo and showing it in the readme?

bertsky commented 3 months ago

ocrd-testset training: foo plot_cer

zdenop commented 3 months ago

Ok for me. After merging, I will try to check it on Windows... (I have some ideas for improvements already)

bertsky commented 3 months ago

So?

zdenop commented 3 months ago

@bertsky : please reformat the Python scripts with blue (or black) there is plenty of formal formatting issues (missing spaces).

and ruff complains about this:

ruff check plot_cer.py
plot_cer.py:43:1: E741 Ambiguous variable name: `l`
Found 1 error.
bertsky commented 3 months ago

In light of https://github.com/tesseract-ocr/tesseract/issues/3763#issuecomment-2028441608 I tend to prefer changing from fast to best models for evaluation.

bertsky commented 3 months ago

@zdenop

and ruff complains about this:

ruff check plot_cer.py
plot_cer.py:43:1: E741 Ambiguous variable name: `l`
Found 1 error.

I fail to see the ambiguity. The manual says they think l can be easily confused with 1. I really don't care for such subjective stances.

please reformat the Python scripts with blue (or black) there is plenty of formal formatting issues (missing spaces).

Not my code originally, so I don't care. But I also don't believe in code stylers. Since @stweil seems to be an avid user of these, I don't think my help is needed in that respect.

I'll resolve the conflict on the readme arising from your concurrent edits in the plotting section, then I'll be done here, I think.