Closed qjhart closed 1 year ago
3971 - same problem. Wines at the top start with "Medoc" and "St. Emilion" (175, 185), then have "Chateau something", while other wines start with "Chateau ..." followed by "Medoc" and "St. Emilion" (173, 268)
3408 - several wines at the bottom (286, 638, 261, 660, ...) have names consisting of two (or more) phrases written in capital letters separated by coma, e.g. 286 MOREY, CLOS DE LA ROCHE, CUVEE VIELLES VIGNES. What's the name here? In most cases names are 2-3-4 words (upper-case), then coma or year, then region / producer (lower-case)
0006 - bottom right - should we include year in the name since all wines have the same name? No. The wine name should be the same, the vintage changes in those examples.
1459 - wines 350 and 342 starts with "Saint Emilion" and "Graves", which are often used as a regions in other wines (e.g. wine 1108 on the same page). Additionally, the section title says that the Talleyrand is the region for this wines. What should be the name for wines like these two (350, 342)?
Still just copy the name as written, include the name and region.
Thanks for all this information. Based on this, Stan will create the following fields based on the Name, recognizing that only the "Name" will always be filled in and other information may be missing:
Name - Includes everything in the name box in original fonts/styles, e.g. "SAINT EMILION 1975, Talleyrand"
The following fields, which are also in PTV: Country Vintage (Year) Wine Type (Still, Sparkling, Fortified) Wine Color (Red, White, Rose)
Additional fields that may incorporate Kaggle wine data Province/State Region Designation (e..g Reserve, Estate) Variety (e.g. Pinot Noir, Chardonnay)
After Stan's parser does its thing, we will attempt to fill in missing fields above using other information on the page, e.g. the year may be above the price table and the wine color may be in the page header.
Name - Includes everything in the name box in original fonts/styles, e.g. "SAINT EMILION 1975, Talleyrand"
I think the final decision is everything except the vintage, so in your example "SAINT EMILION, Talleyrand"
Ah yes, you're correct. Let's still retain a field in our output called something like Name_raw with the full name (including year) as it may be useful for quantifying OCR success.
In general, what is the preferred name of the wine? The whole line between id and price, e.g. "CHATEAU GREYSAC 1964 (Medoc)" or just the name, i.e. "CHATEAU GREYSAC" (assuming we are able to trim region / producer, year and other info)? What are the essential information for the researchers that will be working with our output?
Generally, with the exception of the vintage, you should copy the entry faithfully as the name. This doesn't include vintage, which has it's separate column, or phrases that are obviously not part of the title, eg.
1993 Moet&Chandon, A great Treat! ...... 34.00
In addition, you should maintain the use of capitalization, and items like commas and parens.
There are two reasons for this. First, the rules for the names of wines, and the producers, etc. is very complicated, and not something that non-experts can easily determine. (This is why in the end we even removed the notion of a producer.)
The second reason, is this will help determine things like; what were the big selling points for wine (St. Emillion?) and where they based on the wine nameing standards, and did they stay that way thoughout time.