incubrain / astrotribe

A global network of astronomers helping to inspire and educate the next generation.
https://astrotribe.vercel.app
6 stars 2 forks source link

feat: newsfeed from scraped data with chatGPT summaries #21

Open Drew-Macgibbon opened 1 year ago

Drew-Macgibbon commented 1 year ago

As I said this needs to be something super simple to start, we can expand it later.

Working Files

High-Level Overview

  1. Manually scrape articles from ONE website in local dev environment
  2. Process into the required structure
  3. Store data in the database
  4. Create summaries via OpenAI
  5. Store summaries with the article
  6. Manually assign tags and category
  7. Display to users as a simple card

Essential Data

Scraped Data: title / author name / author url / published date / updated date / original url / post body Automated Data: beginner / intermediate / expert summaries Manual Data: post category / tags

Tasks To Consider

Feature Purpose Difficulty Approved Completed
Which Blog to Scrape: space.com medium yes yes
Use Browser Proxy to Avoid IP Blacklist #92 Use Browser Proxy Medium yes yes
What Data to Scrape/Store low yes yes
DB Table Structure #94 medium yes partial
Create OpenAi Summary medium yes partial
no no
Page: newsfeed page #75 Newsfeed Page Low no partial
UI: PostCard #76 Post Component in UI Low no partial

Define Functionality

Functionality Purpose Difficulty Approved Completed
api: upsert post route #93 Upsert Post Route in API Medium no no
util: unify post data structure #91 Unify Post Data Structure Medium no no
util: summarise post function #97 Summarise Post Function Medium no no
no no

DB Structure:

  1. Articles Table

    • id: Bigint, Primary Key
    • created_at: Timestamp
    • updated_at: Timestamp
    • title: Text
    • link: Text, URL
    • category_id: Integer, Foreign Key to Categories table
    • original: JSONB (to store the 'original' object with title and body)
    • summary: JSONB (to store summaries with 'beginner', 'intermediate', 'expert' levels)
    • author: JSONB (to store author details as a JSON object)
  2. Categories Table

    • id: Integer, Primary Key
    • name: Text Categories are here server\utils\openai\categories.json
  3. Tags Table

    • id: Integer, Primary Key
    • name: Text Tags are here server\utils\openai\tags.json
  4. ArticleTags Table (many-to-many relationship between Articles and Tags)

    • article_id: Bigint, Foreign Key to Articles table
    • tag_id: Integer, Foreign Key to Tags table.
aayu5hgit commented 9 months ago

@Drew-Macgibbon the DB structure will have 4 tables and each with their own attributes. Articles ... Categories ... Tags ... ArticleTags ...

Should we start creating them in the Supabase or initially work with a static JSON format until we develop a basic prototype?

aayu5hgit commented 9 months ago

@Drew-Macgibbon @JapneetRajput below is the explanation of my understanding with the working files in server/utils/openai directory. Review it and correct if I miss out something or if there's invalid point.

Understanding server/utils/openai

  1. /openaiClient.ts

    • Responsible for setting up and exporting an instance of the OpenAI API client.
    • It creates a configuration object with the API key obtained from the environment variables.
    • Exports an instance of the OpenAI API client configured with the API key.
  2. /callOpenAI.ts

    • callOpenAI() function, which is responsible for making asynchronous calls to the OpenAI API.
    • It takes a user prompt, a schema defining the function and its parameters, a system message, and optional configuration parameters.
    • The function constructs a set of messages and configuration options and then makes a call to OpenAI's chat completion endpoint using the OpenAI API client.
    • Returns the data(response) received from the OpenAI API.
  3. /generateSummary.ts

    • Focuses on generating summaries using the OpenAI API by utilizing the callOpenAI function.
    • It defines a schema for the summary generation function, including expected input parameters.
    • zod library for data validation, defining a validation schema for the expected output.
    • The generateSummary() function takes an input string, constructs a prompt for the OpenAI API, and calls the callOpenAI function.
    • It then parses and validates the response from the OpenAI API
    • Returns the validated data, representing a summary of the input string tailored for different levels of understanding (beginner / intermediate / expert).
Drew-Macgibbon commented 9 months ago

@Drew-Macgibbon the DB structure will have 4 tables and each with their own attributes. Articles ... Categories ... Tags ... ArticleTags ...

Should we start creating them in the Supabase or initially work with a static JSON format until we develop a basic prototype?

@aayu5hgit I would do JSON first, focus first on getting each step working the proceeds storing in the DB.

Drew-Macgibbon commented 9 months ago

@aayu5hgit in regards to openai stuff, yes that's accurate. They have significantly improved the ability to respond with JSON, so check the docs and make the appropriate changes to the summary function.

JapneetRajput commented 9 months ago

@Drew-Macgibbon @aayu5hgit

Some changes that we might have to make to incorporate the articles data.

We're getting the data in the format :

Articles Table

Here is the previous schema for reference :

id: Bigint, Primary Key
created_at: Timestamp
updated_at: Timestamp
title: Text
link: Text, URL
category_id: Integer, Foreign Key to Categories table
original: JSONB (to store the 'original' object with title and body)
summary: JSONB (to store summaries with 'beginner', 'intermediate', 'expert' levels)
author: JSONB (to store author details as a JSON object)

Proposed changes :

This is based on the data we're scraping from space.com

Drew-Macgibbon commented 9 months ago

@JapneetRajput sure, we will probably have to download and store the images in supabase to stop our website from getting blacklisted from the post origin sites for hitting their cache too often.

Scrape all images as JSON, we will just download/store the featured images.