microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

Need guidance for key value pair extraction. #791

Open Laxmi530 opened 2 years ago

Laxmi530 commented 2 years ago

Describe Model I am using (LayoutLM): Can someone please guide me how can I get the key value pair from a scanned invoice using LayoutLM.

wolfshow commented 2 years ago

You may follow the LayoutXLM paper and look at this https://github.com/microsoft/unilm/tree/master/layoutxlm#fine-tuning-for-relation-extraction

Laxmi530 commented 2 years ago

@wolfshow Thank you so much for replay. I need some implementation example or the procedure how to do. I gone through the Huggingface site and gone through @NielsRogge tutorial, I saw most of the people doing the fine tuning only. I follow some process to understand the document but getting error. You can see that below. Can you please help me.

image

Laxmi530 commented 2 years ago

@wolfshow can you please share some example which I can follow that or need the guidance how to use LayoutXLM.

NurielWainstein commented 2 years ago

@Laxmi530 hi, have you made it? I'm trying to make the same thing and I'm a bit lost, can you share the steps you took to get where you are/ give any tips?

Laxmi530 commented 2 years ago

@nurielw05 Hai, I tried these code but getting some error. Also I saw your code you did quite well but that is not what exactly the key value pair extraction. You can see the code below if you able to fix the error let me know.

feature_extractor = LayoutLMv2FeatureExtractor(apply_ocr=False)
tokenizer = AutoTokenizer.from_pretrained(path, pad_token='')
model = LayoutLMv2ForRelationExtraction.from_pretrained(path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

image_file = 'image4.png'
image = Image.open(image_file).convert('RGB')
image

width, height = image.size
w_scale = 1000/width
h_scale = 1000/height
ocr_data = pytesseract.image_to_data(image, output_type='data.frame')
ocr_data = ocr_data.dropna()
ocr_data.assign(left_scaled = ocr_data.leftw_scale, width_scaled = ocr_data.widthw_scale,
top_scaled = ocr_data.toph_scale, height_scaled = ocr_data.heighth_scale,
right_scaled = lambda x: x.left_scaled + x.width_scaled,
bottom_scaled = lambda x: x.top_scaled + x.height_scaled)
float_cols = ocr_data.select_dtypes('float').columns
ocr_data[float_cols] = ocr_data[float_cols].round(0).astype(int)
ocr_data = ocr_data.replace(r'^\s*$', np.nan, regex=True)
ocr_data = ocr_data.dropna().reset_index(drop=True)
ocr_datawords = list(ocr_data.text)

coordinates = ocr_data[['left', 'top', 'width', 'height']]
actual_boxes = []
for idx, row in coordinates.iterrows():
x, y, w, h = tuple(row) # the row comes in (left, top, width, height) format
actual_box = [x, y, x+w, y+h] # we turn it into (left, top, left+widght, top+height) to get the actual box
actual_boxes.append(actual_box)

def normalize_box(box, width, height):
return [
int(1000 * (box[0] / width)),
int(1000 * (box[1] / height)),
int(1000 * (box[2] / width)),
int(1000 * (box[3] / height)),
]
boxes = []
for box in actual_boxes:
boxes.append(normalize_box(box, width, height))
encoding = tokenizer.encode_plus(ocr_datawords, boxes=boxes, return_tensors='pt')
input_id = encoding['input_ids']
attention_masks = encoding['attention_mask']
boxes = encoding['bbox']
encoding.keys()
outputs = model(**encoding)

This is the error

AttributeError                            Traceback (most recent call last)
c:\Users\name\Parallel\Trans_LayoutXLM.ipynb Cell 9 in <cell line: 1>()
----> [1](vscode-notebook-cell:/c%3A/Users/name/Parallel%20Project/Trans_LayoutXLM.ipynb#ch0000009?line=0) outputs = model(**encoding)

File c:\Users\name\.conda\envs\layoutlmft\lib\site-packages\torch\nn\modules\module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File c:\Users\name\.conda\envs\layoutlmft\lib\site-packages\transformers\models\layoutlmv2\modeling_layoutlmv2.py:1585, in LayoutLMv2ForRelationExtraction.forward(self, input_ids, bbox, labels, image, attention_mask, token_type_ids, position_ids, head_mask, entities, relations)
   1522 @add_start_docstrings_to_model_forward(LAYOUTLMV2_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
   1523 @replace_return_docstrings(output_type=RegionExtractionOutput, config_class=_CONFIG_FOR_DOC)
   1524 def forward(
   (...)
   1535     relations=None,
   1536 ):
   1537     r"""
   1538     entities (list of dicts of shape `(batch_size,)` where each dict contains:
   1539         {
   (...)
   1582     >>> relations = *****
   1583     ```"""
-> 1585     outputs = self.layoutlmv2(
   1586         input_ids=input_ids,
   1587         bbox=bbox,
   1588         image=image,
   1589         attention_mask=attention_mask,
   1590         token_type_ids=token_type_ids,
   1591         position_ids=position_ids,
   1592         head_mask=head_mask,
...
--> 590     images_input = ((images if torch.is_tensor(images) else images.tensor) - self.pixel_mean) / self.pixel_std
    591     features = self.backbone(images_input)
    592     features = features[self.out_feature_key]

AttributeError: 'NoneType' object has no attribute 'tensor'