Open nileshtrivedi opened 1 year ago
Here is how it could work:
This is a realistic approach till somebody invents a "Github for Datasets".
Hello, I am working on a university project and would like to try to solve this issue. Could you please assign it to me?
@Maria-Aidarus Done. DM me if you'd like to get familiar with the codebase. Can give you a walkthrough.
For potential contributors:
This requires an API to be created on the server. To keep the infra minimal, we can implement this as a Netlify Function - which allows us to use all of NodeJS capabilities. Cloudflare Workers is another option but it's more complex as it is not a standard NodeJS environment.
This will be implemented as an API that takes two parameters, a URL and an OpenAI API Key.
First it obtains the contents of the webpage. This can be done by using web scraping services like ScrapeNinja or Browserless etc.
These contents are simplified and sent to GPT for inferring two values: media type (for eg: whether the webpage represents a book or a video or a course etc) and topics.
Another potential approach is to take a screenshot of the page and send to GPT4-Vision model.
The format must be one of these: https://github.com/learn-awesome/learndb/blob/main/src/formats.js
Topics can be one of these: https://github.com/learn-awesome/learndb/blob/main/db/topics.json
You can skip the other attributes for now. Just extracting these two attributes with high quality will be a good contribution.
There is some complexity involved in keeping the topic taxonomy clean. This may be achievable by some prompt engineering.
Hello, I am working with Maria. Could you please assign it to me too?
Hello, could you assign me this please!
Hi, I'm working with Maria too. Can you assign me to this? Thanks.
Hello, can you please assign me as well?
Starting from nothing but a URL, we need tooling to automatically determine: