ucaslcl / Fox

official code for "Fox: Focus Anywhere for Fine-grained Multi-page Document Understanding"
106 stars 6 forks source link

Question about the en_box_ocr? #3

Open sky-fly97 opened 2 months ago

sky-fly97 commented 2 months ago

Hello, I was looking at the coordinates of the en_box_cor task and noticed that they don't seem to match very well, what is the format of the coordinates and am I corresponding to the wrong one? "image": "en_45.png", "conversations": [ { "from": "human", "value": "<image>\nGive the OCR results of the box [113, 660, 886, 914]." }, { "from": "gpt", "value": "10 This operation is sometimes called śabdātmikā bhāvanā (literally, verbal effectuation). See Kumārila’s Tantravārttika \non Mīmāṃsāsūtra 2.1.1 (TV II.344.14-16): “Of those, that operation, which belongs to linguistic expressions like \nexhortative verbal endings and which motivates a person towards object-directed effectuation (arthātmikā bhāvanā), is \nthe second kind of effectuation (bhāvanā), which is a property of linguistic expressions, which has the nature of \ndesignation (abhidhā), and which is said to be the vidhi.” (tatrārthātmikāyāṃ bhāvānāyāṃ liṅādiśabdānāṃ yaḥ puruṣaṃ prati \nprayojakavyāpāraḥ, sā dvitīyā śabdadharmo ’bhidhātmikā bhāvanā vidhir ucyate|) \n11 Throughout this essay, I will take the meaning (artha) of any linguistic expression (roughly, speaking) to be the object \nthat is conveyed to a linguistically competent hearer by that expression. On the Prābhākara view that Maṇḍana \npresents, the meaning of exhortative verbal endings is niyoga or injunction, also referred to as apūrva (VVS §7, or \nVVMG 36-78). When an agent hears an exhortative verbal ending in the context of an exhortation addressed to her, \nshe undergoes an awareness-event of the form: “I am enjoined’’ (niyukto ’smi). Such self-ascriptions are supposed to \ntrack an entity called injunction—according to one interpretation, something to be done or brought about—which is \nnot accessible by any means of knowing other than language, and, unlike other entities that are part of the natural \nfabric of reality, does not exist in the past, the present, or the future. For discussion of this view in its sources, see \nPrabhākara Miśra’s sub-commentary Bṛhatī on Mīmāṃsāsūtra 2.1.5 along with Śālikanātha Miśra’s Ṛjuvimalā (Bṛ 319-\n324) and the second chapter of Śālikanātha’s Vākyārthamātṛkā in Prakaraṇapañcikā (PP 417-450). For discussion of \nPrabhākara’s view, see Clooney (1990, pp. 245ff) and Yoshimizu (1997, pp. 96ff)." } ] image

ucaslcl commented 2 months ago

The coordinates are the top-left corner and the bottom-right corner, and their values are normalized by the image width/height. You can resize the image into (1000, 1000) and draw the box again.

sky-fly97 commented 2 months ago

The coordinates are the top-left corner and the bottom-right corner, and their values are normalized by the image width/height. You can resize the image into (1000, 1000) and draw the box again.

Well, I don't see the relevant information in the readme, which, as shown in the example therein, may lead to a usage error.

prompt = ann["conversations"][0]["value"] image_file = ann["image"] image_file_path = os.path.join(args.image_path, image_file) image = load_image(image_file_path) outputs = model.generate(image, prompt)

By the way, I'd like to ask if there are test results for other models to share in Table.4?