google-research-datasets / vrdu

We identify the desiderata for a comprehensive benchmark and propose Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including diverse data types, complex templates, and diversity of layouts within a single document type.
76 stars 5 forks source link

Bounding Box Normalization for LayoutLMv3 on VRDU Dataset #4

Open gayecolakoglu opened 1 month ago

gayecolakoglu commented 1 month ago

Hello,

I am working with the VRDU dataset, and I am attempting to normalize the bounding boxes for use with LayoutLMv3. In your paper, I see that OCR is used, and bounding box annotations are provided. However, I am having trouble aligning the bounding boxes in the dataset with LayoutLMv3's 0-1000 normalized coordinate system.

The bounding boxes I currently have are represented as: bboxes: [[0, 0.15008625, 0.033492822, 0.23519264, 0.052631579], ...]

I don’t have access to the page height and width to perform proper normalization, which LayoutLMv3 typically requires for converting coordinates. Could you clarify the following:

  1. Is there a way to extract or compute the page dimensions from the VRDU dataset that I may have overlooked?
  2. Would it be possible to normalize these bounding boxes for use with LayoutLMv3 without this information?
  3. Could you provide additional guidance or tools on how to handle the bounding boxes for models like LayoutLMv3?

Thank you for your help!

tniemeier commented 1 month ago

Hey @gayecolakoglu !

I am also currently attempting to use LayoutLMv3 with the VRDU datasets.

Every dataset item has provided the annotations and the full OCR result. You can't find the dimensions of the page within the annotations, but you can find them in the OCR item.

Some sample code how to unnormalize the bboxes could look like this:

def _unnormalize_bbox(bbox: list[int, float], page_dim: dict[int]) -> list[float]:
    page_height: int = page_dim["height"]
    page_width: int = page_dim["width"]

    x_min: float = bbox[1]
    y_min: float = bbox[2]
    x_max: float = bbox[3]
    y_max: float = bbox[4]

    return [
        x_min * page_width,
        y_min * page_height,
        x_max * page_width,
        y_max * page_height
    ]

for item in load_dataset_item('registration-form'):
    for annotation in item['annotations']:
        ner_tag: str = annotation[0]
        appearances: list[int, float] = annotation[1]
        for appearance in appearances:
            entity_value: str = appearance[0]
            bbox: list[int, float] = appearance[1]
            segments: list[int] = appearance[2]
            bbox_page_number: int = bbox[0]
            unnormalized_bbox: list[float] = _unnormalize_bbox(bbox=bbox, page_dim=item['ocr']['pages'][bbox_page_number]['dimension'])

Note that load_dataset_item yields a dataset dict item once per iteration. I can't assure the correctness of my code, as I am not finished with the fine-tuning.

If you'd like, we can go for some pair programming, getting LayoutLMv3 to run on the VRDU datasets. I am struggling a bit myself, preprocessing the data correctly - encountering plenty of hurdles.

Kind regards, Thomas

gayecolakoglu commented 1 month ago

Hi @tniemeier,

Thank you for your detailed response! After reviewing it, I realized I had overlooked some details, and now I see the values I was looking for in the OCR items.

Upon further examination of the data structure, I noticed that the bbox values are already scaled between 0 and 1(code is below). Given this, I was wondering—would it be sufficient to simply multiply them by 1000 for LayoutLMv3?

def load_dataset(file_path, limit=None):
    annotations = {}
    ocr_texts = {}
    with gzip.open(file_path, "rt") as jsonl_file:
        for i, line in enumerate(jsonl_file):
            if limit is not None and i >= limit:
                break
            data = json.loads(line)
            filename = data["filename"]
            ocr_texts[filename] = data["ocr"]
            annotations[filename] = data["annotations"]
    return ocr_texts, annotations

ocr_texts, annotations = load_dataset(dataset_path, limit=10)
first_doc = next(iter(ocr_texts.values()))
print("Bounding boxes in raw OCR data:")
for page in first_doc["pages"]:
    if "tokens" in page:
        tokens = page["tokens"]
        for token in tokens:
            print(token["bbox"])

It’s great to see someone else working on the same challenge. I’d be happy to collaborate and help each other out as we work through this task.

Kind regards,
Gaye

tniemeier commented 1 month ago

I managed to Fine-Tune and get annotated outputs yesterday. I will probably have some first VRDU Benchmark results at the end of the day.

To make the model work, it was sufficient for me to just multiply the bbox values times 1000.

A bigger problem, that took quite some time to solve was, that the maximum sequence length of the LayoutLMv3 model is 512 and most of the sequences found in this dataset were around 2000-3000 after processing it into a Hugging Face Dataset. The answer of Ali Tavana here: https://stackoverflow.com/questions/74290497/how-to-handle-sequences-longer-than-512-tokens-in-layoutlmv3 helped me a lot to solve it.

Two other problems I experienced:

1. While preprocessing the PDFs for the Hugging Face Dataset, some images were encoded with 'jb2 (JBIG2)' and this seems not to be supported by Pillow. I actually used WSL and installed coppler-utils to create PNGs for every page in these PDFs - similar to this answer: https://stackoverflow.com/questions/60851124/extract-images-from-pdf-how-to-handle-jbig2-encoded. After having the PNGs, it was easy to convert them into Pillow Images and add them to the Dataset.

2. How to handle the multipage aspect of the VRDU documents. For now, I solved it by making every page of a document a data point in my Hugging Face Dataset. But I don't think it is really suiting, as it only learns the single pages and not the document as a whole. When I tried to append a list of images to a data point, the LayoutLMv3 Processor threw an error, as it expects a single image. I wonder how Zilong Wang et al. solved this problem, as I could not find any clue in the paper or anywhere else. Any suggestions on how to solve this in a better way would be greatly appreciated, as I was unable to find anything helpful while using google. The best information I found: https://github.com/NielsRogge/Transformers-Tutorials/issues/114 -and the links Pierce Lamb provided there. But in my understanding, there is no intended way how to handle multipage documents in LayoutLMv3.

If any code examples are needed I can send them, feel free to ask.

Kind regards, Thomas

gayecolakoglu commented 1 month ago

Hi @tniemeier,

Thank you again for your detailed explanation and the progress update!

I’ve worked through the data preprocessing challenges. However, I’m currently facing some difficulties with the labeling and evaluation stages. My goal is to extract entity-value pairs from the documents, but I’m unsure how to correctly handle the labeling and evaluate the model for this task.

Since you’ve mentioned that you were able to generate annotated outputs, I was wondering how you handled the labeling process and model evaluation. Any guidance you can offer along with the steps you followed, would be greatly appreciated.

Thanks again for your support!

Kind regards, Gaye

tniemeier commented 1 month ago

Hi @gayecolakoglu,

For labeling, I used the entity names provided by the VRDU Benchmark and split them into 'B' and 'I' Labels for 'Beginning' and 'In' Word annotation. I used this guide to understand how to label properly: HuggingFace Token Classification

To gather key-value pairs, you need to first structure the prediction, bboxes and tokens on a word basis. I used the following code to do so:

logits = outputs.logits
predictions = logits.argmax(-1).squeeze().tolist()
token_boxes = encoding.bbox.squeeze().tolist()
labels = encoding.labels.squeeze().tolist()
input_ids = encoding["input_ids"].squeeze().tolist()
input_words = [processor.tokenizer.decode(i) for i in input_ids]

if (len(token_boxes) == 512):
    predictions = [predictions]
    token_boxes = [token_boxes]
    labels = [labels]
    input_ids = [input_ids]
    input_words = [input_words]

# Pair predictions on a word basis, flatten it and remove stride
true_predictions: list[str] = []
true_boxes: list[list[float]] = []
true_words: list[str] = []
for i, (pred, box, i_ids, mapped) in enumerate(zip(predictions, token_boxes, input_ids, offset_mapping)):
    # Maps every token if it is a subwords with True or False - used to pair tokens back to words
    is_subword = np.array(mapped.squeeze().tolist())[:, 0] != 0
    if i == 0:
        true_predictions += [id2label[pred_] for idx, pred_ in enumerate(pred) if not is_subword[idx]]
        true_boxes += [unnormalize_box(box_, width, height) for idx, box_ in enumerate(box) if not is_subword[idx]]

        # Pairs words
        for idx, i_ids_ in enumerate(i_ids):
            if not is_subword[idx]:
                true_words.append(processor.tokenizer.decode(i_ids_))
            else:
                true_words[-1] = true_words[-1] + processor.tokenizer.decode(i_ids_)

    else:
        true_predictions += [id2label[pred_] for idx, pred_ in enumerate(pred) if not is_subword[idx]][1 + STRIDE_SIZE - sum(is_subword[:1 + STRIDE_SIZE]):]
        true_boxes += [unnormalize_box(box_, width, height) for idx, box_ in enumerate(box) if not is_subword[idx]][1 + STRIDE_SIZE - sum(is_subword[:1 + STRIDE_SIZE]):]

        temp: list[str] = []
        for idx, i_ids_ in enumerate(i_ids):
            if not is_subword[idx]:
                temp.append(processor.tokenizer.decode(i_ids_))
            else:
                temp[-1] = temp[-1] + processor.tokenizer.decode(i_ids_)
        true_words += temp[1 + STRIDE_SIZE - sum(is_subword[:1 + STRIDE_SIZE]):]

Note that I use a for loop and an if else because I used stride, truncation and padding to handle the longer sequence lengths. So a prediction usually had the shape of [512, N].

To now gain the key-value pairs I now removed all 'other' labels and then accumulated the predictions in an dict like this:

preds = []
l_words = []
bboxes = []
for idx, _pred in enumerate(true_predictions):
    if _pred != 'O':
        preds.append(true_predictions[idx])
        l_words.append(true_words[idx])
        bboxes.append(true_boxes[idx])

extractions: dict[str: list[str]] = {}
for idx, _preds in enumerate(preds):
    identifier, _preds = iob_to_label(_preds)
    _preds = _preds.lower()

    if _preds not in extractions.keys():
        extractions[_preds] = [l_words[idx]]
    else:
        if identifier == 'B':
            extractions[_preds].append(l_words[idx])
        else:
            extractions[_preds] = [''.join(extractions[_preds]) + l_words[idx]]

The 'extractions' dict can be used directly for the VRDU evaluation script - you just need to pair the results per image, back to documents and then into the prediction files.

Hope that helped, I couldn't comment my code properly, just copy and pasted it, so ask any question if you may have some.

Kind regards, Thomas