hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.7k stars 521 forks source link

Errors parsing strings with fractional characters #1004

Open espencer127 opened 7 months ago

espencer127 commented 7 months ago

Pre-filing checks

The URL of the recipe(s) that are not being scraped correctly

The results you expect to see

In the result I posted below, The fractions ½ and in the ingredients block are causing problems for me. I would like to see fractional characters represented like so -

1/2 2/3

The results (including any Python error messages) that you are seeing

The scraper is outputting an accurate object, but it contains characters that cannot be ingested into my java application. For instance, for the above URL i get the following response -

{
    "author": "John Chandler",
    "canonical_url": "https://www.allrecipes.com/recipe/235158/worlds-best-honey-garlic-pork-chops/",
    "category": "Entree,Dinner",
    "cook_time": 20,
    "cuisine": "American",
    "description": "These glazed honey garlic pork chops are quick and simple to cook on the grill in less than 30 minutes for a mouth-watering meal.",
    "host": "allrecipes.com",
    "image": "https://www.allrecipes.com/thmb/xijdHGCdDvaDbX0cZioCuboPPX4=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/235158-worlds-best-honey-garlic-pork-chops-DDMFS-4x3-6d16c2884cdd407eb8e1e2f494791542.jpg",
    "ingredient_groups": [
        {
            "ingredients": [
                "½ cup ketchup",
                "2 ⅔ tablespoons honey",
                "2 tablespoons low-sodium soy sauce",
                "2 cloves garlic, crushed",
                "6 (4 ounce) (1-inch thick) pork chops"
            ],
            "purpose": None
        }
    ],
    "ingredients": [
        "½ cup ketchup",
        "2 ⅔ tablespoons honey",
        "2 tablespoons low-sodium soy sauce",
        "2 cloves garlic, crushed",
        "6 (4 ounce) (1-inch thick) pork chops"
    ],
    "instructions": "Preheat grill for medium heat and lightly oil the grate. Gather ingredients.\nWhisk ketchup, honey, soy sauce, and garlic together in a bowl to make a glaze.\nSear the pork chops on both sides on the preheated grill. Lightly brush glaze onto each side of the chops as they cook; grill until no longer pink in the center, about 7 to 9 minutes per side. An instant-read thermometer inserted into the center should read 145 degrees F (63 degrees C).\nServe hot and enjoy!",
    "instructions_list": [
        "Preheat grill for medium heat and lightly oil the grate. Gather ingredients.",
        "Whisk ketchup, honey, soy sauce, and garlic together in a bowl to make a glaze.",
        "Sear the pork chops on both sides on the preheated grill. Lightly brush glaze onto each side of the chops as they cook; grill until no longer pink in the center, about 7 to 9 minutes per side. An instant-read thermometer inserted into the center should read 145 degrees F (63 degrees C).",
        "Serve hot and enjoy!"
    ],
    "language": "en",
    "nutrients": {
        "calories": "290 kcal",
        "carbohydrateContent": "14 g",
        "cholesterolContent": "95 mg",
        "fiberContent": "0 g",
        "proteinContent": "30 g",
        "saturatedFatContent": "4 g",
        "sodiumContent": "419 mg",
        "sugarContent": "12 g",
        "fatContent": "13 g",
        "unsaturatedFatContent": "0 g"
    },
    "prep_time": 10,
    "ratings": 4.5,
    "site_name": "Allrecipes",
    "title": "World's Best Honey Garlic Pork Chops",
    "total_time": 30,
    "yields": "6 servings"
}

The fractional characters in this block cause the following error in my Java app -

Traceback (most recent call last):
  File "C:\.......\scraper.py", line 18, in <module>
    print(scraper.to_json())
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.752.0_x64__qbz5n2kfra8p0\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u2154' in position 649: character maps to <undefined>

I've already identified a working fix, it can be seen in this commit on my forked code -

https://github.com/espencer127/recipe-scrapers/commit/a07a1d59d9251ce87af2e8154ac178381773576e