Open ediazgallego opened 1 month ago
I think the taxonomy guys are now accepting knowledge sources that are not from Wikipedia https://github.com/instructlab/taxonomy/blob/main/CONTRIBUTING.md.
We probably need to think of a more generic system in the long term. i.e. How can we extract title, revision... from any source information.
@vishnoianil @aevo98765 After looking @mairin's utility scripts, it's clear we need to utilize Wikipedia's APIs to retrieve summary data. There are wrapper packages that simplify the use of the Wikipedia API, but one thing I thought we need to consider is how to handle API calls when a user is entering a URL.
I envision a behavior similar to a search field:
This approach could potentially create significant load on the UI. To mitigate this, we should consider implementing one of the following:
I believe these measures would help us control the frequency of data fetches for auto-populating fields.
Your thoughts and expertise on these would be greatly appreciated.
my guys are now accepting knowledge sources that are not from Wikipedia
I think apart from wikipedia, any knowledge that has associated markdown file and it's source is acceptable contribution at this point of time. @juliadenham @jjasghar please correct me if I am wrong.
+1 on the generic system, although at this point of time we don't know all the sources (and not sure if internet in general can be source or not). So let's start with wikipedia and evolve it as a general system as we learn more about sources.
- button that will make the user to click on it after entering the URL,
@ediazgallego Thanks for sharing the thoughts. Discussion like these will be helpful for other contributors as well. Appreciate it.
How about we hook the data fetch to "OnBlur" event. This event will trigger only when user click's out of the input box. Followed by url validation (it should be valid url). Not sure, if wikipedia requires a api_key to access the summary data of the page, if not, it would be good to implement the fetch on the client side rather than on server side, so it's going to run on client browser and we can prevent any possible scale issue on the server side.
Why
UX recommendations
Contextual Information
From @vishnoianil: Knowledge submission can be sourced from various targets. A simple example would be wikipedia. If user adds the URL to a wikipedia page, we should automatically populate other fields (title, revision, license, author) from wikipedia page.
At this point we can target wikipedia, because upstream taxonomy repo only accepting knowledge contribution based on the wikipedia. In future we will add support for more sources for knowledge contribution, and the extraction process for attribution information can be very specific to each target as well.
You can follow this scripts https://github.com/mairin/instructlab-knowledge-utils?tab=readme-ov-file#1-%EF%B8%8F-wikipedia-attribution-genpy to determine how we can populate this information from wikipedia. Big thanks to @mairin for writing these utilities.