openeduhub / metalookup

Provide metadata about domains w.r.t accessibility, licencing, adds, etc.
GNU General Public License v3.0
5 stars 0 forks source link

Provide clear Input structure for `/extract_meta` endpoint #84

Closed MRuecklCC closed 2 years ago

MRuecklCC commented 2 years ago

Currently, the input data model looks as follows:


class Input(BaseModel):
    url: str = Field(..., description="The base url of the scraped website.")
    html: Optional[str] = Field(
        default=None,
        description="Everything scraped from the website as text.",
    )
    headers: Optional[str] = Field(
        default=None,
        description="The response header originally received together with the content.",
    )
    har: Optional[str] = Field(
        default=None, description="The har object interpretable as json."
    )
    allow_list: Optional[ListTags] = Field(
        default=ListTags(),
        description="A list of key:bool pairs. "
        "Any metadata key == True will be extracted. "
        "If this list is not given, all values will be extracted.",
    )
    debug: Optional[bool] = Field(
        default=True,
        description="Developer flag to receive more information through API",
    )
    bypass_cache: Optional[bool] = Field(
        default=False,
        description="Bypass the cache (true) or not in evaluation.",
    )

From this it is not clear what is supposed to happen for the different combinations of provided values - E.g. what should happen if har and html are provided? Should we use the html block from the har or the html value?

To resolve this, I propose to change the input model to either accept only the URL, or URL plus har. If the haris provided, the extraction would not issue a request to Splash to generate it's own har.

I.e. I propose to drop the html and header fields, as they will either be taken from the har that is delivered as part of the input, or from the response issued to splash if no har was provided.