huggingface / pixparse

Pixel Parsing. A reproduction of OCR-free end-to-end document understanding models with open data
13 stars 3 forks source link

Pablo/task finetune rvlcdip #13

Closed molbap closed 1 year ago

molbap commented 1 year ago

This PR adds a finetuning task for pixparse. It focuses on the simple document classification task on RVLCDIP.

New args are added to app/train.py.

- resume: boolean, whether to resume from an experiment or not.
- checkpoint_path: checkpoint of the experiment to resume from.
- train.format: defaults to "webdataset", but can be switched to "hf_dataset"

Then, a finetuning classification task on rvlcdip can be launched as

python -m pixparse.app.train \
  --task-name cruller_finetune \
  --data.train.source aharley/rvl_cdip \
  --data.train.format hf_dataset \
  --data.train.batch-size 64 \
  --data.train.num-samples 320000 \
  --data.train.num-workers 8 \
  --model-name cruller_base \
  --task.opt.clip-grad-value 1.0 \
  --task.opt.clip-grad-mode norm \
  --task.opt.learning-rate 3e-5 \
  --task.opt.grad-accum-steps 1 \
  --task.opt.betas 0.9 0.99 \
  --task.dtype bfloat16 \
  --task.num-intervals 30 \
  --task.num-warmup-intervals 3 \
  --train.resume True \
  --train.checkpoint-path <checkpoint to resume from> \
  --train.output-checkpoint-dir /fsx/pablo/training_pixparse/ \
  --train.output-dir <output_dir> \
  --train.experiment <experiment_name> \
  --train.tensorboard True \
  --train.log-eval-data False \
  --train.wandb False \
  --train.log-filename out.log

The loader.py is modified to allow for non-webdataset non-s3-stored datasets, namely, hf datasets from the datasets library. This uses chug/LoaderBundle https://github.com/huggingface/chug/blob/cfb16882e1058b37871b61fe8f76830cef3d8750/src/chug/common/types.py#L19. Eventually this should be moved under chug https://github.com/huggingface/chug/issues/2.

molbap commented 1 year ago

This is still in progress, I'm running tests on

molbap commented 1 year ago

I think this one is ready, it should close the other one as well, no conflicts afaik. This adds

@rwightman if you find weird things here, lmk, I think we're good to merge in current state. Notes: