huggingface / pixparse

Pixel Parsing. A reproduction of OCR-free end-to-end document understanding models with open data
11 stars 3 forks source link

Refactor docvqa #37

Closed molbap closed 9 months ago

molbap commented 10 months ago

This makes finetune docVQA and the associated eval task more inline with the recent updates that factored out common task utils. I also added donut-specific preprocessing. This is conditioned on one TODO that are pre-release I would say

Experiments for finetuning on DocVQA on this can run on

python -m pixparse.app.train \
  --task cruller_finetune_docvqa \
  --train-data.source SinglePageDocVQA \
  --train-data.format hf_dataset \
  --train-data.split train \
  --train-data.batch-size 16 \
  --train-data.num-samples 44812 \
  --train-data.num-workers 8 \
  --model.name cruller_swin_384_to_1920 \
  --text-max-length 128 \
  --clip-grad-value 0.25 \
  --clip-grad-mode norm \
  --learning-rate 3e-5 \
  --grad-accum-steps 1 \
  --betas 0.9 0.95 \
  --image-transforms "basic" \
  --dtype bfloat16 \
  --num-intervals 300 \
  --num-warmup-intervals 2 \
  --checkpoint-path ../20231023-094042-task_cruller_pretrain-model_cruller_swin_384_to_1920-lr_3.0e-05-b_16/checkpoint-29.pt \
  --output-checkpoint-dir /fsx/pablo/training_pixparse/ \
  --output-dir /fsx/pablo/training_pixparse/outputs/ \
  --tensorboard True \
  --log-eval-data False \
  --wandb False \
  --log-filename out.log

and for eval

python -m pixparse.app.eval \
  --task cruller_eval_docvqa \
  --source SinglePageDocVQA \
  --format hf_dataset \
  --text-max-length 128 \
  --split val \
  --batch-size 6 \
  --num-samples 5349 \
  --num-workers 8 \
  --model.name cruller_swin_384_to_1920 \
  --dtype bfloat16 \
  --num-intervals 100 \
  --checkpoint-path ...20231110-104215-task_cruller_finetune_docvqa-model_cruller_swin_384_to_1920-lr_3.0e-05-b_16/checkpoint-299.pt \
  --output-dir /fsx/pablo/training_pixparse/outputs/

Where the checkpoint comes from the finished finetuning experiment.

molbap commented 10 months ago

The idea would be to have a cleaner task setup for finetune/eval and refactor the other tasks, eval_ocr in priority, to be aligned.

molbap commented 10 months ago
rwightman commented 9 months ago

@molbap looks like something got fairly hosed in the init of docvqa eval task, bunch of repeated code...

molbap commented 9 months ago

Should be better. Something I added to handle passing args to task_cls init depending on task (some will need checkpoint_path, some resume, etc) is this to get non default arguments, basically passing unpacked non-defaults arguments only to task_cls. Instead of non-default, a better way could be to restrict the argument scope depending on the task_cls but I don't see how to do that without increasing the need for maintenance over all existing tasks, so current solution is

def get_selected_non_default_args(dataclass_instance, arg_names):
    """
    Extracts a subset of non-default arguments from a dataclass instance.

    This checks a specified list of argument names in a given instance. 
    It returns a dictionary of arguments that are not set to their default values.

    Parameters:
    - dataclass_instance: An instance of a dataclass from which to extract arguments.
    - arg_names: A list of strings representing the names of the arguments to be considered.

    Returns:
    - A dictionary containing key-value pairs of argument names and their values,
      for those arguments that are not set to their default values.
    """
    selected_non_default_args = {}
    for field in fields(dataclass_instance.__class__):
        if field.name in arg_names:
            value = getattr(dataclass_instance, field.name)
            default_value = field.default
            if field.default_factory != dataclasses.MISSING:
                default_value = field.default_factory()

            if value != default_value:
                selected_non_default_args[field.name] = value

    return selected_non_default_args