uakarsh / latr

Implementation of LaTr: Layout-aware transformer for scene-text VQA,a novel multimodal architecture for Scene Text Visual Question Answering (STVQA)
https://uakarsh.github.io/latr/
MIT License
52 stars 7 forks source link

An error in dataset.py : create_features function #15

Open HouTong-s opened 1 year ago

HouTong-s commented 1 year ago

I think it should be if use_ocr == True: entries = apply_ocr(img_path) bounding_box = entries["bbox"] words = entries["words"] bounding_box = list(map(lambda x: resize_align_bbox(x,width_old,height_old, width, height), bounding_box)) the line : bounding_box = list(map(lambda x: resize_align_bbox(x,width_old,height_old, width, height), bounding_box)) must in the if use_ocr == True:

Because in the OCRDATASET:

for i in sample_entry[1]['Blocks']:
  if i['BlockType']=='WORD' and i['Page']==1:
    words.append(i['Text'].lower())
    curr_box = i['Geometry']['BoundingBox']
    xmin, ymin, xmax, ymax = curr_box['Left'], curr_box['Top'], curr_box['Width']+ curr_box['Left'], curr_box['Height']+ curr_box['Top']
    curr_bbox =  resize_align_bbox([xmin, ymin, xmax, ymax], 1, 1, width, height)
    coordinates.append(curr_bbox)

## Similar to the docformer's create_features function, but with some changes
img, boxes, tokenized_words = create_features(image_path = tif_path,
                                              tokenizer = self.tokenizer,
                                              target_size = (1000, 1000),
                                              use_ocr = False,
                                              bounding_box = coordinates,
                                              words = words
                                              )

you know, the curr_bbox is actually in the size(1000,1000)