Pablo/task finetune rvlcdip

molbap commented 1 year ago

This PR adds a finetuning task for pixparse. It focuses on the simple document classification task on RVLCDIP.

New args are added to app/train.py.

- resume: boolean, whether to resume from an experiment or not.
- checkpoint_path: checkpoint of the experiment to resume from.
- train.format: defaults to "webdataset", but can be switched to "hf_dataset"

Then, a finetuning classification task on rvlcdip can be launched as

python -m pixparse.app.train \
  --task-name cruller_finetune \
  --data.train.source aharley/rvl_cdip \
  --data.train.format hf_dataset \
  --data.train.batch-size 64 \
  --data.train.num-samples 320000 \
  --data.train.num-workers 8 \
  --model-name cruller_base \
  --task.opt.clip-grad-value 1.0 \
  --task.opt.clip-grad-mode norm \
  --task.opt.learning-rate 3e-5 \
  --task.opt.grad-accum-steps 1 \
  --task.opt.betas 0.9 0.99 \
  --task.dtype bfloat16 \
  --task.num-intervals 30 \
  --task.num-warmup-intervals 3 \
  --train.resume True \
  --train.checkpoint-path <checkpoint to resume from> \
  --train.output-checkpoint-dir /fsx/pablo/training_pixparse/ \
  --train.output-dir <output_dir> \
  --train.experiment <experiment_name> \
  --train.tensorboard True \
  --train.log-eval-data False \
  --train.wandb False \
  --train.log-filename out.log

The loader.py is modified to allow for non-webdataset non-s3-stored datasets, namely, hf datasets from the datasets library. This uses chug/LoaderBundle https://github.com/huggingface/chug/blob/cfb16882e1058b37871b61fe8f76830cef3d8750/src/chug/common/types.py#L19. Eventually this should be moved under chug https://github.com/huggingface/chug/issues/2.

molbap commented 1 year ago

This is still in progress, I'm running tests on

RVLCDIP: xent
RVLCDIP: json-prediction
CORD: json prediction
train ticket: json prediction So that the library can populate a performance board on standard benchmarks easily enough

molbap commented 1 year ago

I think this one is ready, it should close the other one as well, no conflicts afaik. This adds

xent on rvlcdip
json "prediction" on rvlcdip
evaluation on rvlcdip (classification accuracy)
json prediciton on CORDv2
evaluation with tree matching distance and f1 score for CORDv2 I relaunched a cruller_pretrain task as well from this branch and it is running fine:

@rwightman if you find weird things here, lmk, I think we're good to merge in current state. Notes:

One thing that I'm not too happy about is that some part of the tokenization is being handled in the collate_fn rather than the getitem right now, which harms pickling, but changing it would require writing dataset wrappers for each data source of interest.
Now we have to select the data format, "hf_dataset" or "webdataset", mostly for benchmarks purposes. what is WIP:
single-page docVQA is around the corner and will complete the examples nicely, it should follow the same structure as CORD, it's just that we predict what's happening after the special token "". The dataset is not preprocessed and some code in pixparse-data takes care of that for the time being.

huggingface / pixparse

Pablo/task finetune rvlcdip #13