ucd-library / wine-price-extraction

This repository relates to Template, and Machine Extraction of Wine Prices from Sherry Lehmann Catalogs.
MIT License
4 stars 0 forks source link

Attempt Process on Wine Menus #14

Open qjhart opened 5 years ago

qjhart commented 5 years ago

It seem like this process should work on at least a few of the menus that exist in the digital collections. In particular we tried these menus:

Both failed on the final run_wine_database_one_page.R component.

[1] "truth.dir=/io/dsiData"                                               
[2] "in=/io/sloan-ocr/items/d7764v/media/images/d7764v-0/parsed_items.RDS"
[1] 19 38
 [1] "text"                "text_raw"            "text_conf"          
 [4] "name"                "keywords"            "upper_text"         
 [7] "lower_text"          "brackets_text"       "dictionary_hits"    
[10] "id"                  "year"                "color"              
[13] "province"            "region"              "producer"           
[16] "designation"         "variety"             "country"            
[19] "id_conf"             "year_conf"           "color_conf"         
[22] "province_sim"        "region_sim"          "producer_sim"       
[25] "designation_sim"     "variety_sim"         "brackets_conf"      
[28] "dictionary_hits_sim" "upper_text_hit"      "lower_text_hit"     
[31] "brackets_text_hit"   "file_name"           "confidence"         
[34] "inspect"             "dictionary_hit"      "any_hit"            
[37] "table"               "file"               
[1] 0
[1] "text_conf"           "dictionary_hits"     "dictionary_hits_sim"
Error in UseMethod("group_by_") : 
  no applicable method for 'group_by_' applied to an object of class "NULL"
Calls: %>% ... <Anonymous> -> group_by -> group_by.default -> group_by_
Execution halted

The example (in the style of the cloudl computing) is contained in this example.tar.gz

jcarlen commented 5 years ago

I ran the actual price_table_extraction function on those images (downloaded as .jpg) and it worked fine (actually better than what I saw in your example.tar.gz, possibly because of the resolution of my downloads?). It got basically all the prices and many of the names. So the problem must be somewhere down the line in converting that output to tables. (I think you already knew that.) Is there a way from your process to tell exactly where it failed? I use the group_by function a lot so I can’t tell from just that. Sorry I’m not super familiar with the different std* that you shared, so I might just not be looking in the right place.

On Jun 11, 2019, at 5:56 PM, Quinn Hart notifications@github.com wrote:

It seem like this process should work on at least a few of the menus that exist in the digital collections. In particular we tried these menus:

https://digital.ucdavis.edu/ark:/87287/d7rs4c https://digital.ucdavis.edu/ark:/87287/d7rs4c https://digital.ucdavis.edu/ark:/87287/ https://digital.ucdavis.edu/ark:/87287/ d7764v Both failed on the final run_wine_database_one_page.R component.

[1] "truth.dir=/io/dsiData"
[2] "in=/io/sloan-ocr/items/d7764v/media/images/d7764v-0/parsed_items.RDS" [1] 19 38 [1] "text" "text_raw" "text_conf"
[4] "name" "keywords" "upper_text"
[7] "lower_text" "brackets_text" "dictionary_hits"
[10] "id" "year" "color"
[13] "province" "region" "producer"
[16] "designation" "variety" "country"
[19] "id_conf" "year_conf" "color_conf"
[22] "province_sim" "region_sim" "producer_sim"
[25] "designation_sim" "variety_sim" "brackets_conf"
[28] "dictionary_hits_sim" "upper_text_hit" "lower_text_hit"
[31] "brackets_text_hit" "file_name" "confidence"
[34] "inspect" "dictionary_hit" "any_hit"
[37] "table" "file"
[1] 0 [1] "text_conf" "dictionary_hits" "dictionary_hits_sim" Error in UseMethod("groupby") : no applicable method for 'groupby' applied to an object of class "NULL" Calls: %>% ... -> group_by -> group_by.default -> groupby Execution halted The example (in the style of the cloudl computing) is contained in this example.tar.gz https://github.com/ucd-library/wine-price-extraction/files/3279123/example.tar.gz — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ucd-library/wine-price-extraction/issues/14?email_source=notifications&email_token=AAWMTCYVUNCUQ5XNHFLLALDP2BCVXA5CNFSM4HXEOU6KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GY6GWSQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWMTCYTDWAXLAORFIJHVA3P2BCVXANCNFSM4HXEOU6A.

qjhart commented 5 years ago

@Jane Carlen jacarlen@ucdavis.edu can you send along your outputs? I'd like to compare, in particular the boxes and the parsed_items. I have to say it is strange, I see little differences in the outputs depending on who is running the code, I think small differences in tesseact may be to blame.

Regarding your other question, I'm not much of an R debugger, but I'll add in a bunch of print statements.

On Tue, Jun 11, 2019 at 9:22 PM jcarlen notifications@github.com wrote:

I ran the actual price_table_extraction function on those images (downloaded as .jpg) and it worked fine (actually better than what I saw in your example.tar.gz, possibly because of the resolution of my downloads?). It got basically all the prices and many of the names. So the problem must be somewhere down the line in converting that output to tables. (I think you already knew that.) Is there a way from your process to tell exactly where it failed? I use the group_by function a lot so I can’t tell from just that. Sorry I’m not super familiar with the different std* that you shared, so I might just not be looking in the right place.

On Jun 11, 2019, at 5:56 PM, Quinn Hart notifications@github.com wrote:

It seem like this process should work on at least a few of the menus that exist in the digital collections. In particular we tried these menus:

https://digital.ucdavis.edu/ark:/87287/d7rs4c < https://digital.ucdavis.edu/ark:/87287/d7rs4c> https://digital.ucdavis.edu/ark:/87287/ < https://digital.ucdavis.edu/ark:/87287/> d7764v Both failed on the final run_wine_database_one_page.R component.

[1] "truth.dir=/io/dsiData" [2] "in=/io/sloan-ocr/items/d7764v/media/images/d7764v-0/parsed_items.RDS" [1] 19 38 [1] "text" "text_raw" "text_conf" [4] "name" "keywords" "upper_text" [7] "lower_text" "brackets_text" "dictionary_hits" [10] "id" "year" "color" [13] "province" "region" "producer" [16] "designation" "variety" "country" [19] "id_conf" "year_conf" "color_conf" [22] "province_sim" "region_sim" "producer_sim" [25] "designation_sim" "variety_sim" "brackets_conf" [28] "dictionary_hits_sim" "upper_text_hit" "lower_text_hit" [31] "brackets_text_hit" "file_name" "confidence" [34] "inspect" "dictionary_hit" "any_hit" [37] "table" "file" [1] 0 [1] "text_conf" "dictionary_hits" "dictionary_hits_sim" Error in UseMethod("groupby") : no applicable method for 'groupby' applied to an object of class "NULL" Calls: %>% ... -> group_by -> group_by.default -> groupby Execution halted The example (in the style of the cloudl computing) is contained in this example.tar.gz < https://github.com/ucd-library/wine-price-extraction/files/3279123/example.tar.gz

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/ucd-library/wine-price-extraction/issues/14?email_source=notifications&email_token=AAWMTCYVUNCUQ5XNHFLLALDP2BCVXA5CNFSM4HXEOU6KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GY6GWSQ>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAWMTCYTDWAXLAORFIJHVA3P2BCVXANCNFSM4HXEOU6A .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ucd-library/wine-price-extraction/issues/14?email_source=notifications&email_token=AACUG6OM6JPEK2EULJIOAPTP2B2XPA5CNFSM4HXEOU6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXPGQLI#issuecomment-501114925, or mute the thread https://github.com/notifications/unsubscribe-auth/AACUG6K3QAF2P4JN57P6JH3P2B2XPANCNFSM4HXEOU6A .

-- Quinn Hart -- Library -- University of California Davis, CA 95616-8628

jcarlen commented 5 years ago

attached, along with the versions of the images I used as input.: examples.zip

Yes, some print statements to know what step it’s on would be helpful. Thanks.