Closed bertsky closed 2 months ago
With the last commit I did re-instante @Shreeshrii's LOG_FILE variable.
The big pro is that thus you can opt in to plotting even older logs, e.g.
make plot LOG_FILE=nohup.out
I just make quick test on openSUSE (15.5) and here are a few suggestions:
it would be nice to have a short example of how to make example plot on example data:
git clone https://github.com/tesseract-ocr/tesstrain
cd tesstrain
mkdir data
unzip ocrd-testset.zip -d data/ocrd-ground-truth
...
# install needed requirements
...
nohup make training MODEL_NAME=ocrd START_MODEL=frk TESSDATA=~/tessdata_best MAX_ITERATIONS=10000 > plot/TESSTRAIN.LOG &
make plot MODEL_NAME=ocrd
I removed python2 from openSUSE and I got this error:
python plot_cer.py data/ocrd ocrd data/ocrd/ocrd.iteration.tsv data/ocrd/ocrd.checkpoint.tsv data/ocrd/ocrd.eval.tsv data/ocrd/ocrd.sub.tsv data/ocrd/ocrd.lstmeval.tsv
/bin/bash: python: command not found
What about using PY_CMD
(as rest of Makefile)?
When I run manually python3 plot_cer.py data/ocrd ocrd data/ocrd/ocrd.iteration.tsv data/ocrd/ocrd.checkpoint.tsv data/ocrd/ocrd.eval.tsv data/ocrd/ocrd.sub.tsv data/ocrd/ocrd.lstmeval.tsv
I got error:
Traceback (most recent call last):
File "/home/podobny/Projekty/tesstrain/plot_cer.py", line 6, in <module>
import matplotlib
ModuleNotFoundError: No module named 'matplotlib'
It would be good to mentioned that user should install matplotlib and pandas (pip3 install matplotlib pandas
) before running make plot
.
It would be good to mention that user should install matplotlib and pandas (
pip3 install matplotlib pandas
) before runningmake plot
.
Both are mentioned in requirements.txt
, so running pip3 install -r requirements.txt
is sufficient.
@zdenop very good points – thanks! I'll address these in a few follow-up commits.
some of the commits could already be applied to the main branch. Would it be okay if I cherry-pick them? You (or I) would have to rebase your plotting branch after that. I think we can improve plotting faster by getting it integrated like that.
@stweil why the hurry all of a sudden? This does not bode well with my workflow, and I'm already in too many places at a time...
All done. @zdenop do you want me to add some target for doing the pip install requirements, too?
Also, how about adding the result of the plotting for the ocrd-testset training into the repo and showing it in the readme?
ocrd-testset training:
Ok for me. After merging, I will try to check it on Windows... (I have some ideas for improvements already)
So?
In light of https://github.com/tesseract-ocr/tesseract/issues/3763#issuecomment-2028441608 I tend to prefer changing from fast
to best
models for evaluation.
@zdenop
and ruff complains about this:
ruff check plot_cer.py plot_cer.py:43:1: E741 Ambiguous variable name: `l` Found 1 error.
I fail to see the ambiguity. The manual says they think l can be easily confused with 1. I really don't care for such subjective stances.
please reformat the Python scripts with blue (or
black
) there is plenty of formal formatting issues (missing spaces).
Not my code originally, so I don't care. But I also don't believe in code stylers. Since @stweil seems to be an avid user of these, I don't think my help is needed in that respect.
I'll resolve the conflict on the readme arising from your concurrent edits in the plotting section, then I'll be done here, I think.
After several attempts by @Shreeshrii to share her excellent plotting scripts, each of which was unfortunately thwarted by bad circumstances (other big changes occurring at the same time), here comes a plotting facility again.
I based this on the ocrddata branch of her fork, cherry-picking only the two relevant changesets, resolving conflicts and then refactoring to make this better fit our makefileization.
Usage is simply
make plot
, which will only work aftermake training
. (I could also make this dependency explicit, but that would causemake plot
to start the training if it did not happen already for that combination of variables.)The output files will be created in![herrnhut-kurrent tess finetuned-htrbin plot_log](https://github.com/tesseract-ocr/tesstrain/assets/38561704/00f43cfc-2104-43ae-abd9-a91a9a96eb1c)
$OUTPUT_DIR/$MODEL_NAME.plot_log.png
, e.g.and![herrnhut-kurrent tess finetuned-htrbin plot_cer](https://github.com/tesseract-ocr/tesstrain/assets/38561704/a8d1ca21-7362-4465-9b4c-0ffaaf5a067c)
$OUTPUT_DIR/$MODEL_NAME.plot_cer.png
, e.g.All intermediate files (except for the lstmeval log files generated under
$OUTPUT_DIR/eval/*.log
because they are valuable in their own right) are marked as such and therefore removed by make.Perhaps we should discuss how both plots could be combined into a single one (which is probably what @Shreeshrii tried to do already) – I can see that there's a problem by the granularity these data points are recorded (training iterations for validation during lstmtraining vs. learning iterations for validation afterwards via external lstmeval). But IIUC we have everything it takes to be able to combine them (twin y plot with synced x axes)...