Add basic eval_step() on training samples

molbap commented 1 year ago

This PR adds the method .eval_step() for Cruller pretraining.

Every time this function is called, image samples from the current batch are passed through the image encoder.
Then, their embeddings are passed to the text_decoder to generate text.
Finally, the text is compared to the original text in the document using character error rate (CER) and word error rate (WER) from jiwer. Both are logged to tensorboard writer if --tensorboard is selected in config.
[WIP] An image from the batch is logged as well as its corresponding (original, generated) text tuple.

rwightman commented 1 year ago

@molbap overall ocr eval code looks good, nice that can leverage wer/cer calcs from that other lib ... main theme re code comments, would be best to have the eval code pulled out into a different module, reusable from the train task (for ocr metric) or eval tasks, most of it looks to be possible to factor into free functions that could be used for many different model variations, only a small bit is model/task dependent

The above would be useable right away for use via an eval app to eval checkpoints on different datasets, so feel it's probably worthwhile change

molbap commented 1 year ago

I refactored tokenizers out of models, pulled out OCR-related metrics to a new utils file (in a utils dir), still working on moving everything out for a true eval_step. I'm done with loaders for FUNSD and IDL, right now adding a TaskCrullerOCREval that we can call directly, but I see 2 ways to interface with it 1) First one would be a standalone
python -m pixparse.app.eval --data.eval.source "pipe: aws .... eval-dataset-on-s3" --model-name <name> --checkpoint-url <expandable-url-of-checkpoints-to-run> 2) And/or (not mutually exclusive for now) as you describe, we call evaluate() after one interval when it's done. what would not feel natural to me would be to call something like python -m pixparse.app.train --data.eval.source "" without specifying train, to do eval. LMK, either way I'll complete the Task framework for eval :)

rwightman commented 1 year ago

We want both 1 & 2 for sure. This was part of the design I had hoped to figure out before the baby arrived but someone had other plans..

Thinking about it more I feel it's cleaner to have all eval oriented code in eval tasks as mentioned, not have any eval_step in the train task

For #1 (eval app), eval tasks would get created and passed the pretrained or model w/ loaded checkpoint & tokenizer.

For #2 but have a 'prepare' method on the train task that returns 1..N eval tasks, possibly based on keys related to dataset or dataset types so that they can be matched up. 'prepare' would either return new instances of eval tasks, with the model/tokenizer passed from train task, or could cache the instances and re-use.

Ideally for 1 & 2 the eval task would have minimal to no differences, it either gets the model & tokenizer & any needed config passed via the eval app or from the train task.

Thinking a bit more about your original eval (inline with training), that's still something that's great to have, but it's a training metric. It's also a 'metric' in relation to eval, so maybe instead of utils, we have a metrics file. As 'ocr metrics' would be a subset of the possibilities...

rwightman commented 1 year ago

Also, keeping eval and train tasks separate, could simplify the interface so any Task() has just one setup() and step() method since we don't need to differentiate train/eval anymore. This can be done trivially later though, just to keep in mind.

molbap commented 1 year ago

Added src/pixparse/app/eval.py That depends on the ocr eval task src/pixparse/task/task_cruller_eval_ocr.py WIP so that we can launch an eval task with

python -m pixparse.app.eval \
  --data.eval.source "pipe:aws s3 cp s3://...<eval tar shards such as FUNSD>" \
  --data.eval.num-samples 800000 \
  --data.eval.batch-size 16 \
  --data.eval.num-workers 8 \
  --model-name cruller_large \
  --task.dtype bfloat16 \
  --eval.checkpoint-path <checkpoint-path>/checkpoint-X.pt \
  --output-dir <output_metrics_path>

It will return OCR metrics in a dictionary such that metrics['FUNSD'] = {'cer': xx, 'wer': xx} For now it returns this for each iteration. Also, the evaluate() method depends only on loaders, not on tasks, couldn't figure out just yet what design was best so focused on functionality.

Conflict is because task_cruller_pretrain has been moved to a subdir for Cruller-related tasks and gh is not happy with that. I'm pushing in case you can follow and also because the cluster is full and I can't run much there (I need to setup my local machine properly to run clean evals). LMK!

molbap commented 1 year ago

Solved conflict by reverting paths for tasks. will rename paths later.

huggingface / pixparse

Add basic eval_step() on training samples #1