Improving Language Understanding from Screenshots

This repository contains the code, data, and models for paper Improving Language Understanding from Screenshots. In this paper, we focus on improving the language understanding ability of "screenshot LM" (models that process everything -- including text -- within visual inputs) and propose patch-and-text prediction (PTP), a novel pre-training objective for screenshot LMs.

Illustration for PTP

Environment

Firstly, please install the latest compatible PyTorch.

Then, install all the required packages by running:

pip install -r requirements.txt

We strongly recommend using the exact same transformers and accelerate versions for best reproducibility. Please checkout the renderer readme to make sure that the renderer is correctly configured.

Preparing the data

For our encoder-decoder experiments and the train-from-scratch autoregressive screenshot LM experiments, we use Wikipedia+BookCorpus as the pre-training data. You can find the already-tokenized dataset from this Huggingface website. You can download the data by

git clone https://huggingface.co/datasets/princeton-nlp/ptp_data data

This folder contains four files

wikibook_256_opt_tk_train.npy and wikibook_256_opt_tk_val.npy: Wiki+Book using OPT tokenizer, 256 tokens per example (for encoder-decoder).
wikibook_512_llama_tk_train.npy and wikibook_512_llama_tk_val.npy: Wiki+Book using LLAMA tokenizer, 512 tokens per example (for train-from scratch autoregressive).

For continuing training Sheared-llama to use screenshots, we use Sheared-llama's pipeline for processing RedPajama data. Please follow this guideline for processing the data. Our example config will use ./data/sheared-llama-rp/for_ft for continuing pre-training and ./data/sheared-llama-rp/eval for evaluation.

Reproducing our pre-trained models

To reproduce our models, run the following command (requires 8 GPUs):

NUM_GPU=8 bash run_multiple_gpus.sh {CONFIG PATH}

There are three example configs:

run_configs/ptp.yaml: our main PTP model (encoder-decoder).
run_configs/screenshot-llama-380m.yaml: train-from-scratch autoregressive.
run_configs/screenshot-llama-1.3b-from-sheared-llama.yaml: continuing pre-training sheared-llama.

You can also run the single-GPU command run_single_gpu.sh for testing. To ensure the same hyperparameters, you should adjust the per-GPU batch size (per_device_train_batch_size) or the gradient accumulation steps (gradient_accumulation_steps) accordingly if you are not using 8 GPUs or your GPUs cannot fit our preset batch sizes.

Downloading our models

We provide the following pre-trained models on Huggingface:

Fine-tuning PTP models

Coming soon!

Bugs or questions?

If you have any questions related to the paper, feel free to email Tianyu (tianyug@cs.princeton.edu). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you use PTP in your work:

@article{gao2024improving,
  title={Improving Language Understanding from Screenshots},
  author={Gao, Tianyu and Wang, Zirui and Bhaskar, Adithya and Chen, Danqi},
  journal={arXiv preprint arXiv:2402.14073},
  year={2024}
}

princeton-nlp / PTP

readme