ucd-library / wine-price-extraction

This repository relates to Template, and Machine Extraction of Wine Prices from Sherry Lehmann Catalogs.
MIT License
4 stars 0 forks source link

Run parse_item on a per page basis #10

Closed qjhart closed 5 years ago

qjhart commented 5 years ago

The current setup has run_parse_items.R running all the parse_items step and consolidating the results. I think it's probably better to run this as a per-page process, for cloud runability. I've checked in a new branch parse_item_cloud that adds a library to the Dockerfile.

To do the least harm, I run this script like:

for i in $(find /io/sloan-ocr/catalogs -name \*-[0-9][0-9][0-9].RDS); do 
 d=$(dirname $i); 
 echo $d;  
 if [[ -f $d/parse_folder_sample.RDS ]]; then 
   echo $d/parse_folder_sample.RDS exists;
 else
   Rscript --vanilla run_parse_items.R name.input.dir=$d dictionary.dir=/io/dsiData/dictionaries name.output.dir=$d
  fi ;
done

where /io/sloan-ocr matches the cloud directory. This puts the somewhat strangely named parse_folder_sample.RDS file in the direcotry. There are possible better ways to run the single file, but the parseFolder function behaves differently.

qjhart commented 5 years ago

@jrmerz I'm fully committed to get this operational, so let me know if you want any modifcations to the current setup. The current setup draws the library data from csv files, so there shouldn't be any db issues. I could fairly easily redo the script to a per file setup, but I don't think that's a slow-down.

qjhart commented 5 years ago

@jcarlen , can you look this branch over and let me know if there is a better way to do this. I'm trying to produce a more complete per-pages csv output method. This looks like the next step after what Justin has coded in the cloud so far. I'm looking into the final script now.

qjhart commented 5 years ago

Modified original issue to eliminate absolute path.