Design backend process to retrieve meta data

yoavartzi commented 2 months ago

We need to automatically retrieve meta data from sources that allow us to do this with an API. The design should be general, so we can add such services relatively easily.

The service to call should be identified by the URL of the recommended paper. Given a URL that is not in our database, we check for each registered service in the order they are registered in (e.g., url in service), and then query the first one we get true from. It might not return anything (if the paper ID is invalid). If not, we continue to remaining services. If we do get a response, we use this data, and it's not editable. If we don't get any hit, we just let the user fill in the data.

Each service also defines if the service is "verified" or not (see #33 and #252).

Once an existing service is identified, we use its service-specific code to get the paper information, and add it to our database.

This can be done by defining an abstract class. Whenever we add a service, we just need to implement this class, and register it with the backend.

Very rough design:

class PaperService():
    def get_meta_data(self, url : str) -> PaperRecord:
        pass

    def __contains__(self, url : str) -> bool:
        pass

yoavartzi commented 2 months ago

The paper record can also return a new URL, because the service might normalize URLs in some way (see how we plan to normalize with arXiv -> #256 )

joannechen1223 commented 1 month ago

Design doc for digital library integration: https://www.notion.so/Digital-Libraries-Integration-arXiv-26e1673576a84541bb4d789fea2ff518

The URL normalization can be fulfilled by applying regular expression matching at least under the context of arXiv.

lil-lab / recnet

Design backend process to retrieve meta data #254