Open slevin48 opened 1 year ago
https://jamesturk.net/posts/scraping-with-gpt-4/
import openai
html = requests.get(url)
completion = openai.ChatCompletion.create(
engine="gpt-4",
# this controls how long the JSON output can be,
# 2048 tokens is about 8,000 characters
# which should be more than enough
# (note: this impacts the cost of the request)
max_tokens=2048,
# temperature controls how random the output is
# 0 is completely deterministic
# which is what we want for scraping
temperature=0,
# at the time of writing I only had GPT-4
# access via the chat interface
messages=[
{
"text": 'Convert the given HTML to JSON with the schema'
'{"name": "string", "age": "number"}',
"user": "system",
},
{
"text": html.text,
"user": "user",
},
],
)
# extract JSON
data = json.loads(completion.choices[0]["message"]["content"])
https://community.openai.com/t/turn-any-website-into-an-api-with-gpt-4/145689 https://www.kadoa.com/playground
Example: Financial Data (Yahoo Finance)
DOM selectors
Extracted data
generated-scraper.py