Auto-Populate remaining fields in Attribution Information form.

ediazgallego commented 1 month ago

Why

After typing the URL in the attribution information form, we want to auto-populate the remaining fields.

UX recommendations

Implement logic to validate if the URL can be used to populate the remaining fields.

Contextual Information

From @vishnoianil: Knowledge submission can be sourced from various targets. A simple example would be wikipedia. If user adds the URL to a wikipedia page, we should automatically populate other fields (title, revision, license, author) from wikipedia page.

At this point we can target wikipedia, because upstream taxonomy repo only accepting knowledge contribution based on the wikipedia. In future we will add support for more sources for knowledge contribution, and the extraction process for attribution information can be very specific to each target as well.

You can follow this scripts https://github.com/mairin/instructlab-knowledge-utils?tab=readme-ov-file#1-%EF%B8%8F-wikipedia-attribution-genpy to determine how we can populate this information from wikipedia. Big thanks to @mairin for writing these utilities.

aevo98765 commented 1 month ago

I think the taxonomy guys are now accepting knowledge sources that are not from Wikipedia https://github.com/instructlab/taxonomy/blob/main/CONTRIBUTING.md.

We probably need to think of a more generic system in the long term. i.e. How can we extract title, revision... from any source information.

ediazgallego commented 1 month ago

@vishnoianil @aevo98765 After looking @mairin's utility scripts, it's clear we need to utilize Wikipedia's APIs to retrieve summary data. There are wrapper packages that simplify the use of the Wikipedia API, but one thing I thought we need to consider is how to handle API calls when a user is entering a URL.

First Assumption Approach

I envision a behavior similar to a search field:

The request is triggered as the user types or when the input reaches a certain length.
Results are fetched either from cache or the API.
The UI is re-rendered with the results.

Potential Issues

This approach could potentially create significant load on the UI. To mitigate this, we should consider implementing one of the following:

A validation button that will make the user to click on it after entering the URL, the click event should then validate the URL and if all seems correct, we then fetch the data and populate remaining fields with available data.
URL verification logic, instead of a button this one could behave more like form validation that is happening in the background and once the validation completes and verifies the URL it also fetch the data and populate remaining fields.

I believe these measures would help us control the frequency of data fetches for auto-populating fields.

Questions for Discussion

Which approach do you think would be more user-friendly?
Are there any performance concerns we should address?

Your thoughts and expertise on these would be greatly appreciated.

vishnoianil commented 1 month ago

my guys are now accepting knowledge sources that are not from Wikipedia

I think apart from wikipedia, any knowledge that has associated markdown file and it's source is acceptable contribution at this point of time. @juliadenham @jjasghar please correct me if I am wrong.

+1 on the generic system, although at this point of time we don't know all the sources (and not sure if internet in general can be source or not). So let's start with wikipedia and evolve it as a general system as we learn more about sources.

vishnoianil commented 1 month ago

button that will make the user to click on it after entering the URL,

@ediazgallego Thanks for sharing the thoughts. Discussion like these will be helpful for other contributors as well. Appreciate it.

How about we hook the data fetch to "OnBlur" event. This event will trigger only when user click's out of the input box. Followed by url validation (it should be valid url). Not sure, if wikipedia requires a api_key to access the summary data of the page, if not, it would be good to implement the fetch on the client side rather than on server side, so it's going to run on client browser and we can prevent any possible scale issue on the server side.

instructlab / ui