pixparse
)An open reproduction of OCR-free end-to-end document understanding models with open data.
Broadly focused on these model types:
The training objectives and pretraining datasets will also be inspired by the associated papers above, but will mix and match. For example, we may train a Donut or PaLI-X style model with a Pix2Struct objective (masked document images w/ simplified HTML target text).
To launch a pretraining Cruller Task on IDL data, you would need these arguments in these scopes. The task-name argument selects which task is to be run, in this case cruller_pretrain.
python -m pixparse.app.train \
--task-name cruller_pretrain \
--data.train.source "pipe:aws s3 cp s3://url-to-IDL-webdataset-shards/idl_shard-00{000..699}.tar -" \
--data.train.batch-size 8 \
--data.train.num-samples 800000 \
--data.train.num-workers 8 \
--task.model-name cruller_large \
--task.opt.clip-grad-value 1.0 \
--task.opt.clip-grad-mode norm \
--task.opt.learning-rate 3e-4 \
--task.opt.grad-accum-steps 1 \
--task.opt.betas 0.9 0.98 \
--task.dtype bfloat16 \
--task.num-intervals 30 \
--task.num-warmup-intervals 3 \
--train.checkpoint-dir <your_checkpoint_dir> \
--train.output-dir <where logs and tb files are created> \
--train.experiment awesome_experiment\
--train.tensorboard True \
--train.log-eval-data False \
--train.wandb False \
--train.log-filename out.log
To launch evaluation on existing checkpoints, you need to use a Cruller Eval Task, e.g. on FUNSD dataset. The task-name argument will select which task is to be run. donut_eval_ocr, for instance, runs Donut as an OCR engine on the dataset chosen and does not need external checkpoints.
python -m pixparse.app.eval \
--eval.task-name cruller_eval_ocr \
--data.eval.source "pipe:aws s3 cp s3://.../FUNSD/FUNSD-000000.tar -" \
--data.eval.num-samples 200 \
--data.eval.batch-size 16 \
--data.eval.num-workers 8 \
--model-name cruller_large_6layers \
--task.dtype bfloat16 \
--s3-bucket pixparse-exps \
--resume True
--eval.checkpoint-path 20230629-231529-model_cruller_large-lr_0.0003-b_12/checkpoints/checkpoint-29.pt \
--output-dir /fsx/pablo/
metrics will be saved under output_dir, with a name derived from the checkpoint used.
To finetune a pretrained pixparse model on RVLCDIP json completion:
python -m pixparse.app.train \
--task-name cruller_finetune_rvlcdip \
--data.train.source aharley/rvl_cdip \
--data.train.format hf_dataset \
--data.train.split train \
--data.train.batch-size 32 \
--data.train.num-samples 320000 \
--data.train.num-workers 8 \
--model-name cruller_base \
--task.opt.clip-grad-value 1.0 \
--task.opt.clip-grad-mode norm \
--task.opt.learning-rate 1e-4 \
--task.opt.grad-accum-steps 1 \
--task.opt.betas 0.9 0.99 \
--task.dtype bfloat16 \
--task.num-intervals \
--task.num-warmup-intervals 1 \
--train.resume True \
--train.checkpoint-path /fsx/pablo/training_pixparse/cruller_Aug11th_base_30/checkpoint-8.pt \
--train.output-checkpoint-dir /fsx/pablo/training_pixparse/ \
--train.output-dir /fsx/pablo/training_pixparse/outputs/ \
--train.tensorboard True \
--train.log-eval-data False \
--train.wandb False \
--train.log-filename out.log
To evaluate a model finetuned on RVLCDIP:
python -m pixparse.app.eval \
--task-name cruller_eval_rvlcdip \
--data.eval.source aharley/rvl_cdip \
--data.eval.format hf_dataset \
--data.eval.split test \
--data.eval.num-samples 40000 \
--data.eval.batch-size 16 \
--data.eval.num-workers 8 \
--model-name cruller_base \
--task.dtype bfloat16 \
--output-dir /fsx/pablo/metrics_finetune \
--eval.checkpoint-path "/fsx/pablo/training_pixparse/20230823-151033-task_cruller_finetune_rvlcdip-model_cruller_base-lr_1.0e-04-b_32/checkpoint-4.pt" \
This will write the accuracy metrics in metrics_finetune directory.
To finetune a model on CORD dataset:
python -m pixparse.app.train \
--task-name cruller_finetune_cord \
--data.train.source naver-clova-ix/cord-v2 \
--data.train.format hf_dataset \
--data.train.split train \
--data.train.batch-size 32 \
--data.train.num-samples 800 \
--data.train.num-workers 8 \
--model-name cruller_base \
--task.opt.clip-grad-value 1.0 \
--task.opt.clip-grad-mode norm \
--task.opt.learning-rate 3e-4 \
--task.opt.grad-accum-steps 1 \
--task.opt.betas 0.9 0.99 \
--task.opt.layer-decay 0.75 \
--task.dtype bfloat16 \
--task.num-intervals 30 \
--task.num-warmup-intervals 3 \
--train.resume True \
--train.checkpoint-path /fsx/pablo/training_pixparse/cruller_Aug11th_base_30/checkpoint-8.pt \
--train.output-checkpoint-dir /fsx/pablo/training_pixparse/ \
--train.output-dir /fsx/pablo/training_pixparse/outputs/ \
--train.tensorboard True \
--train.log-eval-data False \
--train.wandb False \
--train.log-filename out.log
To evaluate a model on CORD dataset:
python -m pixparse.app.eval \
--task-name cruller_eval_cord \
--data.eval.source naver-clova-ix/cord-v2 \
--data.eval.format hf_dataset \
--data.eval.split test \
--data.eval.num-samples 100 \
--data.eval.batch-size 16 \
--data.eval.num-workers 8 \
--model-name cruller_base \
--task.dtype bfloat16 \
--output-dir /fsx/pablo/metrics_finetune \
--eval.checkpoint-path /fsx/pablo/training_pixparse/20230830-133114-task_cruller_finetune_cord-model_cruller_base-lr_3.0e-05-b_8/checkpoint-29.pt \
2023-06-14
2023-06-12
python -m pixparse.app.train --train.source "/data/cc12m/cc12m-train-{0000..xxxx}.tar" --train.batch-size 8 --train.num-samples 10000000 --learning-rate 1e-4 --clip-grad-value 1.0 --clip-grad-mode norm --grad-accum-steps 4
Within src/pixparse
:
app/
- CLI applications for training and evaluation
app/train.py
- main training CLI entrypoint, will attempt to keep useable across tasksapp/eval.py
- (TODO) main evaluation CLI entrypointapp/finetune.py
- (TBD) fine-tune is handled by train.py with different args/config or separate?data/
- data loaders, image and text preprocessingframework/
- lightweight train & evaluation scaffolding on top of canonical PyTorchlayers/
- custom nn.Modules and functions for re-usable modelling componentsmodels/
- modelling code with associated factory methods and helperstask/
- task wrappers for various objectives (model + loss fn + pre/post-processing + optimization nuances)tokenizer/
- tokenizer helpers (push into data?)utils/
- misc utils that don't have a homeSome terms and concepts used in this project that may be a bit unfamiliar.
A key organization concept in this project. Package the model with its loss, pre/post-processing, and optimization setup together for a given objective.
Examples of tasks conceptually:
With the Task concept, the data pipeline exists outside the task. Samples and targets are fed into the task via the step functions. The data pipeline is coupled to the task by passing the pre-processing functions created within the task to the data pipeline on creation.
You'll see the term 'interval' in the code, sometimes next to epoch. It's related, but an epoch means 'one complete pass of the dataset', an interval may be an epoch, but it may not. Interval is a span of training between checkpoints, ideally meaningful enough in duration to warrant evaluating and archiving each interval checkpoint.
In OpenCLIP development the term arose when using shard sampling with replacement, were the intervals between checkpoints were determined by limitations on job durations or likelihood of crashes.