feat: newsfeed from scraped data with chatGPT summaries

Drew-Macgibbon commented 1 year ago

As I said this needs to be something super simple to start, we can expand it later.

Working Files

everything in server/utils/news
server/utils/openai/callOpenAi.ts, server/utils/openai/generateSummary.ts and server/utils/openai/openaiClient.ts
server/api/admin/generate-summary.ts, server/api/admin/scrape-blogs.ts, and server/api/admin/get-blogs.ts
pages/astrotribe/news

High-Level Overview

Manually scrape articles from ONE website in local dev environment
Process into the required structure
Store data in the database
Create summaries via OpenAI
Store summaries with the article
Manually assign tags and category
Display to users as a simple card

Essential Data

Scraped Data: title / author name / author url / published date / updated date / original url / post body Automated Data: beginner / intermediate / expert summaries Manual Data: post category / tags

Tasks To Consider

Feature	Purpose	Difficulty	Approved	Completed
Which Blog to Scrape: space.com		medium		yes	yes
Use Browser Proxy to Avoid IP Blacklist #92	Use Browser Proxy	Medium	yes	yes
What Data to Scrape/Store		low		yes	yes
DB Table Structure #94		medium		yes	partial
Create OpenAi Summary		medium	yes	partial
			no	no
Page: newsfeed page #75	Newsfeed Page	Low	no	partial
UI: PostCard #76	Post Component in UI	Low	no	partial

Define Functionality

Functionality	Purpose	Difficulty	Approved	Completed
api: upsert post route #93	Upsert Post Route in API	Medium	no	no
util: unify post data structure #91	Unify Post Data Structure	Medium	no	no
util: summarise post function #97	Summarise Post Function	Medium	no	no
			no	no

DB Structure:

Articles Table
- id: Bigint, Primary Key
- created_at: Timestamp
- updated_at: Timestamp
- title: Text
- link: Text, URL
- category_id: Integer, Foreign Key to Categories table
- original: JSONB (to store the 'original' object with title and body)
- summary: JSONB (to store summaries with 'beginner', 'intermediate', 'expert' levels)
- author: JSONB (to store author details as a JSON object)
Categories Table
- id: Integer, Primary Key
- name: Text Categories are here server\utils\openai\categories.json
Tags Table
- id: Integer, Primary Key
- name: Text Tags are here server\utils\openai\tags.json
ArticleTags Table (many-to-many relationship between Articles and Tags)
- article_id: Bigint, Foreign Key to Articles table
- tag_id: Integer, Foreign Key to Tags table.

aayu5hgit commented 9 months ago

@Drew-Macgibbon the DB structure will have 4 tables and each with their own attributes. Articles ... Categories ... Tags ... ArticleTags ...

Should we start creating them in the Supabase or initially work with a static JSON format until we develop a basic prototype?

aayu5hgit commented 9 months ago

@Drew-Macgibbon @JapneetRajput below is the explanation of my understanding with the working files in server/utils/openai directory. Review it and correct if I miss out something or if there's invalid point.

Understanding `server/utils/openai`

/openaiClient.ts
- Responsible for setting up and exporting an instance of the OpenAI API client.
- It creates a configuration object with the API key obtained from the environment variables.
- Exports an instance of the OpenAI API client configured with the API key.
/callOpenAI.ts
- callOpenAI() function, which is responsible for making asynchronous calls to the OpenAI API.
- It takes a user prompt, a schema defining the function and its parameters, a system message, and optional configuration parameters.
- The function constructs a set of messages and configuration options and then makes a call to OpenAI's chat completion endpoint using the OpenAI API client.
- Returns the data(response) received from the OpenAI API.
/generateSummary.ts
- Focuses on generating summaries using the OpenAI API by utilizing the callOpenAI function.
- It defines a schema for the summary generation function, including expected input parameters.
- zod library for data validation, defining a validation schema for the expected output.
- The generateSummary() function takes an input string, constructs a prompt for the OpenAI API, and calls the callOpenAI function.
- It then parses and validates the response from the OpenAI API
- Returns the validated data, representing a summary of the input string tailored for different levels of understanding (beginner / intermediate / expert).

Drew-Macgibbon commented 9 months ago

@Drew-Macgibbon the DB structure will have 4 tables and each with their own attributes. Articles ... Categories ... Tags ... ArticleTags ...

Should we start creating them in the Supabase or initially work with a static JSON format until we develop a basic prototype?

@aayu5hgit I would do JSON first, focus first on getting each step working the proceeds storing in the DB.

Drew-Macgibbon commented 9 months ago

@aayu5hgit in regards to openai stuff, yes that's accurate. They have significantly improved the ability to respond with JSON, so check the docs and make the appropriate changes to the summary function.

JapneetRajput commented 9 months ago

@Drew-Macgibbon @aayu5hgit

Some changes that we might have to make to incorporate the articles data.

We're getting the data in the format :

Articles Table

Here is the previous schema for reference :

id: Bigint, Primary Key
created_at: Timestamp
updated_at: Timestamp
title: Text
link: Text, URL
category_id: Integer, Foreign Key to Categories table
original: JSONB (to store the 'original' object with title and body)
summary: JSONB (to store summaries with 'beginner', 'intermediate', 'expert' levels)
author: JSONB (to store author details as a JSON object)

Proposed changes :

Adding a key for images/media. Format is below ⬇️ Array of objects : { src: Text, URL alt: Text caption: Text }
Adding a key for published date. published: timestampz

This is based on the data we're scraping from space.com

Drew-Macgibbon commented 9 months ago

@JapneetRajput sure, we will probably have to download and store the images in supabase to stop our website from getting blacklisted from the post origin sites for hitting their cache too often.

Scrape all images as JSON, we will just download/store the featured images.

incubrain / astrotribe