ucd-library / wine-price-extraction

This repository relates to Template, and Machine Extraction of Wine Prices from Sherry Lehmann Catalogs.
MIT License
4 stars 0 forks source link

Properly Using the DSI truth tables. #15

Open qjhart opened 5 years ago

qjhart commented 5 years ago

I am having difficulty understanding the best way to use the DSI truth tables. There are about 1900 price table from DSI, but I can't reliably align these with the outputs from the process to use very many of them.

In addition, looking at these confuses me as to the difference between price_raw and price_new in the process as well. Here are some examples:

First, I tried to left join the truth to their tables based on the the cluster and row.

with t as (\
 select t.page_ark,t.cluster,t.row,t.text_true,
 p."text.true","col.header",p.price_raw,
 p.type_new,p.price_new 
 from jane.dsi_truth t left join jane.price_name p 
 on (p.file_id=t.page_ark and p.cluster=t.cluster and p.row=t.row)
 order by page_ark,cluster,row
) 
select * from t;

Spreadsheet

Text.true

For the entire run, I only get 48 non-null text.true in the price_name table. Why so few?

price_raw vs price_new

The prices from d761s3g-002 are confusing in that good price_raw are replaced with FALSE in price_new. (More often price_new is a replication of price_raw when it is good.) How do we get the best price from then price_name table?

Bad Rows

Jane had already identified that when the rows are confused the truth tables get jumbled. d71s3g-002 shows that problem as well.

d7259v-012

For some reason, I can't get the entries to align, plus, we have some cases where the price_new is worse than the price_raw. (eg. name_id=d7259v-012_1_2 )

d7xw26-023

I wouldn't expect there to be so many clusters in most stituations. This is the example with the most clusters identified in a truth table. I'm not sure why these are identified as having 8 clusters, and not 4 in two tables. I don't understand the ordering of the clusters either.

Comparing with the selected data is pretty hard, we can see the names are mangled pretty badly, so I'm not going to compare these outputs.

qjhart commented 5 years ago

@jcarlen Yesterday, we identified this item; https://digital.ucdavis.edu/ark:/87287/d71s3g, In the runs I did, the price_new became FALSE for many good items. For yours that was not the case. Here is a zip file of the data as calculated in the cloud.
d71s3g-002.zip If you can spot any errors in the *.RDS files against yours, let me know.

I ran this again locally, and got the same results. I will do two things; 1) Send a script to run our version of the code for testing 2) Try and review the price_new creation as well.

jcarlen commented 5 years ago

In this case, like a previous one, my .RDS output is different than yours. example2.zip

The randomness of the differences makes me think that the first thing to check is the image resolution. In this example my input image is a 4000 x 6000 .jpg.

qjhart commented 5 years ago

@jcarlen You are saying the that _data1.RDS files are different as well? These used the 4kx6k images.

jcarlen commented 5 years ago

Yes, the _data1.RDS files are different, and I calculated that mine has 55 potential prices, whereas yours has 29. Sorry, should have included it with the example: UCD_Lehmann_3392_data1.RDS.zip

qjhart commented 5 years ago

@jcarlen Can you run your process on this file: https://digital.ucdavis.edu/ark:/87287/d71s3g/media/images/d71s3g-002.jpg. Can you share your image? Are the the same? From what I have these are the same file.

jcarlen commented 5 years ago

Yes, the images are the same. The output in example2.zip is for that file.