ucd-library / wine-price-extraction

This repository relates to Template, and Machine Extraction of Wine Prices from Sherry Lehmann Catalogs.
MIT License
4 stars 0 forks source link

Proper Labeling #9

Closed qjhart closed 1 year ago

qjhart commented 6 years ago

In general, what is the preferred name of the wine? The whole line between id and price, e.g. "CHATEAU GREYSAC 1964 (Medoc)" or just the name, i.e. "CHATEAU GREYSAC" (assuming we are able to trim region / producer, year and other info)? What are the essential information for the researchers that will be working with our output?

Generally, with the exception of the vintage, you should copy the entry faithfully as the name. This doesn't include vintage, which has it's separate column, or phrases that are obviously not part of the title, eg. 1993 Moet&Chandon, A great Treat! ...... 34.00

In addition, you should maintain the use of capitalization, and items like commas and parens.

There are two reasons for this. First, the rules for the names of wines, and the producers, etc. is very complicated, and not something that non-experts can easily determine. (This is why in the end we even removed the notion of a producer.)

The second reason, is this will help determine things like; what were the big selling points for wine (St. Emillion?) and where they based on the wine nameing standards, and did they stay that way thoughout time.

qjhart commented 6 years ago

3971 - same problem. Wines at the top start with "Medoc" and "St. Emilion" (175, 185), then have "Chateau something", while other wines start with "Chateau ..." followed by "Medoc" and "St. Emilion" (173, 268)

image

image

qjhart commented 6 years ago

3408 - several wines at the bottom (286, 638, 261, 660, ...) have names consisting of two (or more) phrases written in capital letters separated by coma, e.g. 286 MOREY, CLOS DE LA ROCHE, CUVEE VIELLES VIGNES. What's the name here? In most cases names are 2-3-4 words (upper-case), then coma or year, then region / producer (lower-case)

image

qjhart commented 6 years ago

0006 - bottom right - should we include year in the name since all wines have the same name? No. The wine name should be the same, the vintage changes in those examples.

image

qjhart commented 6 years ago

1459 - wines 350 and 342 starts with "Saint Emilion" and "Graves", which are often used as a regions in other wines (e.g. wine 1108 on the same page). Additionally, the section title says that the Talleyrand is the region for this wines. What should be the name for wines like these two (350, 342)?

Still just copy the name as written, include the name and region.

image

image

jcarlen commented 5 years ago

Thanks for all this information. Based on this, Stan will create the following fields based on the Name, recognizing that only the "Name" will always be filled in and other information may be missing:

Name - Includes everything in the name box in original fonts/styles, e.g. "SAINT EMILION 1975, Talleyrand"

The following fields, which are also in PTV: Country Vintage (Year) Wine Type (Still, Sparkling, Fortified) Wine Color (Red, White, Rose)

Additional fields that may incorporate Kaggle wine data Province/State Region Designation (e..g Reserve, Estate) Variety (e.g. Pinot Noir, Chardonnay)

After Stan's parser does its thing, we will attempt to fill in missing fields above using other information on the page, e.g. the year may be above the price table and the wine color may be in the page header.

ssaganowski commented 5 years ago

Name - Includes everything in the name box in original fonts/styles, e.g. "SAINT EMILION 1975, Talleyrand"

I think the final decision is everything except the vintage, so in your example "SAINT EMILION, Talleyrand"

jcarlen commented 5 years ago

Ah yes, you're correct. Let's still retain a field in our output called something like Name_raw with the full name (including year) as it may be useful for quantifying OCR success.