Automatic enrichment - Githubissues

nileshtrivedi commented 1 year ago

Starting from nothing but a URL, we need tooling to automatically determine:

nileshtrivedi commented 12 months ago

Here is how it could work:

Create a public API that takes a URL of a learning resource, and possibly OpenAI's API key
Use GPT on the page contents to determine the above metadata. Media type and Topics are the only fields mandatory.
Programmaticaly Create a commit to make this change in your fork
Raise a pull request in this repository so that it can be reviewed, merged and deployed on learnawesome.org. Preventing spam and keeping a high quality bar is an important goal of this project. In any case, you can maintain your own database as you like.
This API can then called by a form, a browser extension, a discord/slack bot etc.

This is a realistic approach till somebody invents a "Github for Datasets".

Maria-Aidarus commented 10 months ago

Hello, I am working on a university project and would like to try to solve this issue. Could you please assign it to me?

nileshtrivedi commented 10 months ago

@Maria-Aidarus Done. DM me if you'd like to get familiar with the codebase. Can give you a walkthrough.

nileshtrivedi commented 10 months ago

For potential contributors:

This requires an API to be created on the server. To keep the infra minimal, we can implement this as a Netlify Function - which allows us to use all of NodeJS capabilities. Cloudflare Workers is another option but it's more complex as it is not a standard NodeJS environment.
This will be implemented as an API that takes two parameters, a URL and an OpenAI API Key.
First it obtains the contents of the webpage. This can be done by using web scraping services like ScrapeNinja or Browserless etc.
These contents are simplified and sent to GPT for inferring two values: media type (for eg: whether the webpage represents a book or a video or a course etc) and topics.
Another potential approach is to take a screenshot of the page and send to GPT4-Vision model.
The format must be one of these: https://github.com/learn-awesome/learndb/blob/main/src/formats.js
Topics can be one of these: https://github.com/learn-awesome/learndb/blob/main/db/topics.json

You can skip the other attributes for now. Just extracting these two attributes with high quality will be a good contribution.

There is some complexity involved in keeping the topic taxonomy clean. This may be achievable by some prompt engineering.

rama0711 commented 10 months ago

Hello, I am working with Maria. Could you please assign it to me too?

aishaalsubaie commented 10 months ago

Hello, could you assign me this please!

Skultrix commented 10 months ago

Hi, I'm working with Maria too. Can you assign me to this? Thanks.

hjoad commented 10 months ago

Hello, can you please assign me as well?

learn-awesome / learndb