ucd-library / wine-price-extraction

This repository relates to Template, and Machine Extraction of Wine Prices from Sherry Lehmann Catalogs.
MIT License
4 stars 0 forks source link

Comparison of entry_price, entry_name boxes and PNG files #21

Open qjhart opened 5 years ago

qjhart commented 5 years ago

Here are two examples of pages to look at; one with DSI truth data and one with PTV truth data The bounding boxes in the entry_* tables seem to be based on a new coordinate system that is shown in your skewed png files. These values are not deskewed back?

Here are some examples of a non-skewed and a skewed page.

DSI Truth Table - d7301d-004 UCD_Lehmann_1470

This is an example of a page where all the text boxes move to the left of the page.

d7301d-004_name_boxes

Statistically, these are much more likely to happen, and most page reviews do not properly show where the wine names come from.

DSI data d74k53-011 or UCD_Lehmann_3661

Is an example of a pretty successful (96% hit rate) price extraction. We can't pull off the box location from the png files, since they don't match, but we can redraw using gimp or something to test, so... looking at the png file we can replicate the box or

Downloading the JPG and hand drawing the box we get: UL=418,870 and LR=2326,928

The values from entry_price are:

postgres=# select l,t,r,b,name_trim from entry_name where file_id='d74k53-011' and text_raw like '%786%'; l | t | r | b | name_trim
--------+-----+---------+-----+----------------------------------- 401.25 | 929 | 2540.75 | 866 | BORDEAUX BLANC Domaine de Marquis

And the bbox on the wine_search is the Longer (for the price)

select bbox from wine_search where page_ark='d74k53-011' and name='BORDEAUX BLANC Domaine de Marquis';            
                                                  bbox                                                  
--------------------------------------------------------------------------------------------------------
 {"type":"Polygon","coordinates":[[[401.25,-929],[401.25,-869],[3355,-869],[3355,-929],[401.25,-929]]]}

PTV data d7pp4q-023

A really good PTV page is: d7pp4q-023

Downloading the JPG and hand drawing the box we get: UL=336,1416 LR=1488,1455

Screenshot from 2019-06-27 16-37-36

The entry_price bbox is;

select l,t,r,b,name_trim from entry_name where file_id='d7pp4q-023' and text_raw like '%120%';
   l   |  t   |   r    |  b   |    name_trim     
-------+------+--------+------+------------------
 353.5 | 1431 | 1523.5 | 1396 | MARCEL PERE BRUT

Or, from the wine_search...

This one doesn't look right, and from the png image, we can see why, the image is not rotated into the original coordinates:

select bbox from wine_search where page_ark='d7pp4q-023' and name='MARCEL PERE BRUT' and perprice=7.49;

 ----------------------------------------------------------------------------------------------------------
{"type":"Polygon","coordinates":[[[353.5,-1431],[353.5,-1401],[1873,-1401],[1873,-1431],[353.5,-1431]]]}

Screenshot from 2019-06-27 16-37-54

jcarlen commented 5 years ago

A few things. Please let me know if this is the information you were looking for, and follow up with any remaining questions.

1) First, I discovered the likely cause for the boxes you were seeing that were far too short (as we talked about yesterday). I used the lag function from the dplyr package when figuring out the box sizes under certain conditions (e.g. an ID is missing), but this can conflict with the lag function from the stats package (depending on the order in which they're loaded). I changed my code to explicitly call the dplyr version in all instances. I committed the change to tablewine, so please update the package and let me know if that fix works for you.

In the example file (d7pp4q-023 = UCD_Lehmann_1031), there was one instance of a too-short box before I made this change, but now the boxes look good (There's a couple that err on the side of too tall, and the rest look just right.) I attached the output after making that fix.

In your example (d7301d-004 = UCD_Lehmann_1470) with the too-long name boxes (extending to the left of the page) the results also look pretty good now (attached). However, because of the order I was addressing issues, I didn't run this example right before making the fix so I can't be sure this is what caused the improvement. So please let me know if this fixes things on your end

2) To make sure we're on the same page, the "left, bottom, right, top" or "l, b, r, t" fields in ENTRY_PRICE and ENTRY_NAME respectively relate to the deskewed image. The "price_center_x_orig, price_center_y_orig" (in ENTRY_PRICE) and "name_center_x_orig, name_center_y_orig" (in ENTRY_NAME) fields are transformed back to the original coordinate system (original image before deskewing). The ENTRY_PAGE table contains the deskew angle, in case other conversions are needed.

jcarlen commented 5 years ago

Attachments referenced in my last message (don't think they went through before). d7301d-004.zip d7pp4q-023.zip

qjhart commented 5 years ago

@jcarlen, thanks for the fix. You fix seems to work but strangely, there still exist some minor changes to the output.

d7dd4q-023

Mine seems to have one extra better fix then yours:

who image
mine Screenshot from 2019-07-02 11-29-36
yours Screenshot from 2019-07-02 11-38-30

d7301d-004

There are considerably more differences in this image. Even though the code is supposed to be the same, and training data is the same we get different results. Aothough, some of your later fixes might have affected this as well. But both price and name boxes are different.

Price Boxes

who image
mine Screenshot from 2019-07-02 11-55-02
yours Screenshot from 2019-07-02 11-59-32

Name Boxes

who image
mine Screenshot from 2019-07-02 12-01-03
yours Screenshot from 2019-07-02 12-01-58