ucd-library / wine-price-extraction

This repository relates to Template, and Machine Extraction of Wine Prices from Sherry Lehmann Catalogs.
MIT License
4 stars 0 forks source link

verify that text_conf is included in wine_database #18

Open qjhart opened 5 years ago

qjhart commented 5 years ago

When running the code one file at a time, you occasionally see this problem. Earlier in the code, the exclude1 parameter is defined.

https://github.com/ucd-library/wine-price-extraction/blob/9c667ecb83c7c6fbdf790cb50c8f820ea4a0f068/dsi/scripts/run_wine_database_one_page.R#L96

Later in the code, this is used to get the text_conf column. https://github.com/ucd-library/wine-price-extraction/blob/9c667ecb83c7c6fbdf790cb50c8f820ea4a0f068/dsi/scripts/run_wine_database_one_page.R#L152

However, not all pages include text_conf in the exclude1 parameter. For example, we get a failure with some pages; eg. d7js34-015 ,UCD_Lehmann_3372 where the code is dying at:

Warning message: In ENTRY_NAME$name_id != name_output$name_id : longer object length is not a multiple of shorter object length
[1] "dictionary_hits" "dictionary_hits_sim" Error in data.frame(text = "", confidence = 0, name_id = ENTRY_NAME$name_id[i], :
arguments imply differing number of rows: 1, 0 Calls: lapply -> lapply -> FUN -> data.frame Execution halted

Note, if you add the line

exclude1=c("text_conf","dictionary_hits","dictionary_hits_sim")

Before the NAME_MATCH, then these errors seem to be okay. Not sure if that's the best solution.

jcarlen commented 5 years ago

Trying to figure this out and I have a question. The output contains:

"In ENTRY_NAME$name_id != name_output$name_id : longer object length is not a multiple of shorter object length”

I think this shouldn’t happen if you only have one file. ENTRY_NAME$name_id should be equal to name_output$name_id (and it is when I run the file), which means their lengths should definitely be equal. Can you inspect those and see the difference?

I will try to get your test-one.sh working for me for future troubleshooting.

On Jun 18, 2019, at 3:46 PM, Quinn Hart notifications@github.com wrote:

In ENTRY_NAME$name_id != name_output$name_id : longer object length is not a multiple of shorter object length

qjhart commented 5 years ago

Here are the name_ids. ENTRY_NAME has some _data entries in there?

[1] "ENTRY_NAME"
 [1] "d7js34-015_1_1"       "d7js34-015_1_2"       "d7js34-015_1_3"                                              
 [4] "d7js34-015_1_4"       "d7js34-015_1_5"       "d7js34-015_1_6"                                              
 [7] "d7js34-015_1_7"       "d7js34-015_1_8"       "d7js34-015_1_9"                                              
[10] "d7js34-015_1_10"      "d7js34-015_1_11"      "d7js34-015_1_12"                                             
[13] "d7js34-015_1_13"      "d7js34-015_1_14"      "d7js34-015_1_15"                                             
[16] "d7js34-015_1_16"      "d7js34-015_1_17"      "d7js34-015_1_18"                                             
[19] "d7js34-015_1_19"      "d7js34-015_1_20"      "d7js34-015_1_21"                                             
[22] "d7js34-015_1_22"      "d7js34-015_1_23"      "d7js34-015_1_24"                                             
[25] "d7js34-015_1_25"      "d7js34-015_1_26"      "d7js34-015_1_27"                                             
[28] "d7js34-015_1_28"      "d7js34-015_1_29"      "d7js34-015_1_30"                                             
[31] "d7js34-015_1_31"      "d7js34-015_1_32"      "d7js34-015_1_33"                                             
[34] "d7js34-015_1_34"      "d7js34-015_1_35"      "d7js34-015_1_36"     
[37] "d7js34-015_2_1"       "d7js34-015_2_2"       "d7js34-015_2_3"      
[40] "d7js34-015_2_4"       "d7js34-015_2_5"       "d7js34-015_2_6"      
[43] "d7js34-015_2_7"       "d7js34-015_2_8"       "d7js34-015_2_9"      
[46] "d7js34-015_2_10"      "d7js34-015_2_11"      "d7js34-015_2_12"     
[49] "d7js34-015_2_13"      "d7js34-015_2_14"      "d7js34-015_2_15"     
[52] "d7js34-015_2_16"      "d7js34-015_2_17"      "d7js34-015_2_18"     
[55] "d7js34-015_2_19"      "d7js34-015_2_20"      "d7js34-015_2_21"     
[58] "d7js34-015_2_22"      "d7js34-015_2_23"      "d7js34-015_2_24"     
[61] "d7js34-015_2_25"      "d7js34-015_2_26"      "d7js34-015_2_27"     
[64] "d7js34-015_2_28"      "d7js34-015_2_29"      "d7js34-015_2_30"     
[67] "d7js34-015_2_31"      "d7js34-015_2_32"      "d7js34-015_2_33"     
[70] "d7js34-015_2_34"      "d7js34-015_2_35"      "d7js34-015_2_36"     
[73] "d7js34-015_2_37"      "d7js34-015_2_38"      "d7js34-015_2_39"     
[76] "d7js34-015_2_40"      "d7js34-015_2_41"      "d7js34-015_2_42"     
[79] "d7js34-015_2_43"      "d7js34-015_2_44"      "d7js34-015_data1_1_1"
[82] "d7js34-015_data1_1_2" "d7js34-015_data1_0_1" "d7js34-015_data1_0_2"
[1] "name_output"
 [1] "d7js34-015_1_1"  "d7js34-015_1_2"  "d7js34-015_1_3"  "d7js34-015_1_4" 
 [5] "d7js34-015_1_5"  "d7js34-015_1_6"  "d7js34-015_1_7"  "d7js34-015_1_8" 
 [9] "d7js34-015_1_9"  "d7js34-015_1_10" "d7js34-015_1_11" "d7js34-015_1_12"
[13] "d7js34-015_1_13" "d7js34-015_1_14" "d7js34-015_1_15" "d7js34-015_1_16"
[17] "d7js34-015_1_17" "d7js34-015_1_18" "d7js34-015_1_19" "d7js34-015_1_20"
[21] "d7js34-015_1_21" "d7js34-015_1_22" "d7js34-015_1_23" "d7js34-015_1_24"
[25] "d7js34-015_1_25" "d7js34-015_1_26" "d7js34-015_1_27" "d7js34-015_1_28"
[29] "d7js34-015_1_29" "d7js34-015_1_30" "d7js34-015_1_31" "d7js34-015_1_32"
[33] "d7js34-015_1_33" "d7js34-015_1_34" "d7js34-015_1_35" "d7js34-015_1_36"
[37] "d7js34-015_2_1"  "d7js34-015_2_2"  "d7js34-015_2_3"  "d7js34-015_2_4" 
[41] "d7js34-015_2_5"  "d7js34-015_2_6"  "d7js34-015_2_7"  "d7js34-015_2_8" 
[45] "d7js34-015_2_9"  "d7js34-015_2_10" "d7js34-015_2_11" "d7js34-015_2_12"
[49] "d7js34-015_2_13" "d7js34-015_2_14" "d7js34-015_2_15" "d7js34-015_2_16"
[53] "d7js34-015_2_17" "d7js34-015_2_18" "d7js34-015_2_19" "d7js34-015_2_20"
[57] "d7js34-015_2_21" "d7js34-015_2_22" "d7js34-015_2_23" "d7js34-015_2_24"
[61] "d7js34-015_2_25" "d7js34-015_2_26" "d7js34-015_2_27" "d7js34-015_2_28"
[65] "d7js34-015_2_29" "d7js34-015_2_30" "d7js34-015_2_31" "d7js34-015_2_32"
[69] "d7js34-015_2_33" "d7js34-015_2_34" "d7js34-015_2_35" "d7js34-015_2_36"
[73] "d7js34-015_2_37" "d7js34-015_2_38" "d7js34-015_2_39" "d7js34-015_2_40"
[77] "d7js34-015_2_41" "d7js34-015_2_42" "d7js34-015_2_43" "d7js34-015_2_44"
jcarlen commented 5 years ago

I think this comes from parseFolder not effectively filtering out data1 objects if they're in the same folder as the .RDS output. I just committed an adjustment to to the regex for that which should fix it, but let me know if the _data entries are still there. (https://github.com/ucd-library/wine-price-extraction/commit/4e67aba3b8a109e5ac9538b20ff3a168e62be005#diff-3c565c483b1f64ad72b8e506bf482b1d)

qjhart commented 5 years ago

@jcarlen we discussed the fact that the above fix was not a complete fix, here's the results of running this for the same item as above; d7js34-015, It's still the same error as before. I hope you were able to get the test-one.sh script working

dsi/scripts/test-one.sh d7js34-015
....
Loading required package: ggplot2
[1] "truth.dir=/opt/dsi/Data"         "in=d7js34-015/parsed_folder.RDS"
[1] "ENTRY_NAME"
 [1] "d7js34-015_1_1"  "d7js34-015_1_2"  "d7js34-015_1_3"  "d7js34-015_1_4" 
 [5] "d7js34-015_1_5"  "d7js34-015_1_6"  "d7js34-015_1_7"  "d7js34-015_1_8" 
 [9] "d7js34-015_1_9"  "d7js34-015_1_10" "d7js34-015_1_11" "d7js34-015_1_12"
[13] "d7js34-015_1_13" "d7js34-015_1_14" "d7js34-015_1_15" "d7js34-015_1_16"
[17] "d7js34-015_1_17" "d7js34-015_1_18" "d7js34-015_1_19" "d7js34-015_1_20"
[21] "d7js34-015_1_21" "d7js34-015_1_22" "d7js34-015_1_23" "d7js34-015_1_24"
[25] "d7js34-015_1_25" "d7js34-015_1_26" "d7js34-015_1_27" "d7js34-015_1_28"
[29] "d7js34-015_1_29" "d7js34-015_1_30" "d7js34-015_1_31" "d7js34-015_1_32"
[33] "d7js34-015_1_33" "d7js34-015_1_34" "d7js34-015_1_35" "d7js34-015_1_36"
[37] "d7js34-015_2_1"  "d7js34-015_2_2"  "d7js34-015_2_3"  "d7js34-015_2_4" 
[41] "d7js34-015_2_5"  "d7js34-015_2_6"  "d7js34-015_2_7"  "d7js34-015_2_8" 
[45] "d7js34-015_2_9"  "d7js34-015_2_10" "d7js34-015_2_11" "d7js34-015_2_12"
[49] "d7js34-015_2_13" "d7js34-015_2_14" "d7js34-015_2_15" "d7js34-015_2_16"
[53] "d7js34-015_2_17" "d7js34-015_2_18" "d7js34-015_2_19" "d7js34-015_2_20"
[57] "d7js34-015_2_21" "d7js34-015_2_22" "d7js34-015_2_23" "d7js34-015_2_24"
[61] "d7js34-015_2_25" "d7js34-015_2_26" "d7js34-015_2_27" "d7js34-015_2_28"
[65] "d7js34-015_2_29" "d7js34-015_2_30" "d7js34-015_2_31" "d7js34-015_2_32"
[69] "d7js34-015_2_33" "d7js34-015_2_34" "d7js34-015_2_35" "d7js34-015_2_36"
[73] "d7js34-015_2_37" "d7js34-015_2_38" "d7js34-015_2_39" "d7js34-015_2_40"
[77] "d7js34-015_2_41" "d7js34-015_2_42" "d7js34-015_2_43" "d7js34-015_2_44"
[1] "name_output"
 [1] "d7js34-015_1_1"  "d7js34-015_1_2"  "d7js34-015_1_3"  "d7js34-015_1_4" 
 [5] "d7js34-015_1_5"  "d7js34-015_1_6"  "d7js34-015_1_7"  "d7js34-015_1_8" 
 [9] "d7js34-015_1_9"  "d7js34-015_1_10" "d7js34-015_1_11" "d7js34-015_1_12"
[13] "d7js34-015_1_13" "d7js34-015_1_14" "d7js34-015_1_15" "d7js34-015_1_16"
[17] "d7js34-015_1_17" "d7js34-015_1_18" "d7js34-015_1_19" "d7js34-015_1_20"
[21] "d7js34-015_1_21" "d7js34-015_1_22" "d7js34-015_1_23" "d7js34-015_1_24"
[25] "d7js34-015_1_25" "d7js34-015_1_26" "d7js34-015_1_27" "d7js34-015_1_28"
[29] "d7js34-015_1_29" "d7js34-015_1_30" "d7js34-015_1_31" "d7js34-015_1_32"
[33] "d7js34-015_1_33" "d7js34-015_1_34" "d7js34-015_1_35" "d7js34-015_1_36"
[37] "d7js34-015_2_1"  "d7js34-015_2_2"  "d7js34-015_2_3"  "d7js34-015_2_4" 
[41] "d7js34-015_2_5"  "d7js34-015_2_6"  "d7js34-015_2_7"  "d7js34-015_2_8" 
[45] "d7js34-015_2_9"  "d7js34-015_2_10" "d7js34-015_2_11" "d7js34-015_2_12"
[49] "d7js34-015_2_13" "d7js34-015_2_14" "d7js34-015_2_15" "d7js34-015_2_16"
[53] "d7js34-015_2_17" "d7js34-015_2_18" "d7js34-015_2_19" "d7js34-015_2_20"
[57] "d7js34-015_2_21" "d7js34-015_2_22" "d7js34-015_2_23" "d7js34-015_2_24"
[61] "d7js34-015_2_25" "d7js34-015_2_26" "d7js34-015_2_27" "d7js34-015_2_28"
[65] "d7js34-015_2_29" "d7js34-015_2_30" "d7js34-015_2_31" "d7js34-015_2_32"
[69] "d7js34-015_2_33" "d7js34-015_2_34" "d7js34-015_2_35" "d7js34-015_2_36"
[73] "d7js34-015_2_37" "d7js34-015_2_38" "d7js34-015_2_39" "d7js34-015_2_40"
[77] "d7js34-015_2_41" "d7js34-015_2_42" "d7js34-015_2_43" "d7js34-015_2_44"
[1] 0
[1] "dictionary_hits"     "dictionary_hits_sim"
Error in data.frame(text = "", confidence = 0, name_id = ENTRY_NAME$name_id[i],  : 
  arguments imply differing number of rows: 1, 0
Calls: lapply -> lapply -> FUN -> data.frame
Execution halted
jcarlen commented 5 years ago

I'm not able to replicate this problem locally. I think it's caused by ENTRY_NAME$name_id not having an ith entry in some case (where i is between 1 and length(NAME_MATCH)), but I'm not sure beyond that.

I'm also not able to get test-one.sh to run, either locally or with docker. I'm new to docker, so would you or Justin be able to help me troubleshoot my setup?

qjhart commented 5 years ago

@jcarlen, hmm for whatever reason, applying the updates for the boxes, seems to have fixed this issue. At least for the example above :)

qjhart commented 5 years ago

@jrmerz , if you have a chance to touch base w/ @jcarlen re. getting her docker config set and running the test-one.sh file that would be great.