ucaslcl / Fox

official code for "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding"
126 stars 7 forks source link

Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding

Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang

Release

Contents

1. Benchmark Data and Evaluation Tool

(2) bilingual region-level OCR --gtfile_path cn_box_ocr.json --image_path cn_pdf_png/ --gtfile_path en_box_ocr.json --image_path en_pdf_png/

(3) bilingual line-level OCR --gtfile_path en_line_ocr.json --image_path cn_pdf_png/ --gtfile_path cn_line_ocr.json --image_path en_pdf_png/

(4) bilingual color-guided OCR --gtfile_path cn_onbox_ocr.json --image_path cn_pdf_png_onbox/ --gtfile_path en_onbox_ocr.json --image_path en_pdf_png_onbox/

(5) region-level translation (English-to-Chinese) --gtfile_path en_box_translation.json --image_path en_pdf_png/

(6) region-level summary --gtfile_path en_box_summary.json --image_path en_pdf_png/

(7) in-document figure caption --gtfile_path en_page_indoc_caption.json --image_path en_pdf_png_render_laioncoco/

(8) multi-page multi-region OCR --gtfile_path encn-multi-8page-box-ocr.json --image_path pdfpng_encn_multi_8page/

(9) cross-page VQA --gtfile_path encn-multi-8page-cross-vqa.json --image_path pdfpng_encn_multi_8page/

- For each sub-task, you need to run your model and record the results, we give a simple script here:
```python
import argparse
import torch
import os
from tqdm import tqdm
import json
from PIL import Image

def load_image(image_file):
    image = Image.open(image_file).convert('RGB')
    return image

output_list = []

def eval_model(args):

    # TODO Use your model
    model = build_your_model()

    gts_path = args.gtfile_path
    gts = json.load(open(gts_path))

    print("Generate Results......")
    for ann in tqdm(gts):
        output_json = {}

        prompt = ann["conversations"][0]["value"]
        image_file = ann["image"] 
        image_file_path = os.path.join(args.image_path, image_file)
        image = load_image(image_file_path)

        # TODO Use your model
        outputs = model.generate(image, prompt)

        output_json['image'] = ann["image"]
        output_json['question'] = prompt 
        output_json['label'] = ann["conversations"][1]["value"]
        output_json['answer'] = outputs
        output_list.append(output_json)

    filename = args.out_file
    with open(filename, 'w', encoding="utf-8") as file_obj:
        json.dump(output_list, file_obj, ensure_ascii=False, indent=1)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-name", type=str, default="")
    parser.add_argument("--gtfile_path", type=str, required=True)
    parser.add_argument("--image_path", type=str, required=True)
    parser.add_argument("--out_file", type=str, default="./results_final.json")
    args = parser.parse_args()
    print(args)
    eval_model(args)
(1) Calculate BLEU, METEOR, F1-score, Precision, Recall, Edit Distance:
python3 eval_tools/eval_ocr_test.py --out_file "./results_final.json"

(2) Calculate ROUGE:
python3 eval_tools/eval_summary_test.py --out_file "./results_final.json"

(3) Calculate Accuracy:
python3 eval_tools/eval_qa_test.py --out_file "./results_final.json"

Contact

If you have any questions related to the code or the paper, feel free to email (liuchenglong20@mails.ucas.ac.cn).

Code License Data License

Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of Vary, Opt.

Citation

If you find our work useful in your research, please consider citing Fox:

@article{liu2024focus,
  title={Focus Anywhere for Fine-grained Multi-page Document Understanding},
  author={Liu, Chenglong and Wei, Haoran and Chen, Jinyue and Kong, Lingyu and Ge, Zheng and Zhu, Zining and Zhao, Liang and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2405.14295},
  year={2024}
}