class Input(BaseModel):
url: str = Field(..., description="The base url of the scraped website.")
html: Optional[str] = Field(
default=None,
description="Everything scraped from the website as text.",
)
headers: Optional[str] = Field(
default=None,
description="The response header originally received together with the content.",
)
har: Optional[str] = Field(
default=None, description="The har object interpretable as json."
)
allow_list: Optional[ListTags] = Field(
default=ListTags(),
description="A list of key:bool pairs. "
"Any metadata key == True will be extracted. "
"If this list is not given, all values will be extracted.",
)
debug: Optional[bool] = Field(
default=True,
description="Developer flag to receive more information through API",
)
bypass_cache: Optional[bool] = Field(
default=False,
description="Bypass the cache (true) or not in evaluation.",
)
From this it is not clear what is supposed to happen for the different combinations of provided values - E.g. what should happen if har and html are provided? Should we use the html block from the har or the html value?
To resolve this, I propose to change the input model to either accept only the URL, or URL plus har. If the haris provided, the extraction would not issue a request to Splash to generate it's own har.
I.e. I propose to drop the html and header fields, as they will either be taken from the har that is delivered as part of the input, or from the response issued to splash if no har was provided.
Currently, the input data model looks as follows:
From this it is not clear what is supposed to happen for the different combinations of provided values - E.g. what should happen if
har
andhtml
are provided? Should we use the html block from thehar
or thehtml
value?To resolve this, I propose to change the input model to either accept only the URL, or URL plus
har
. If thehar
is provided, the extraction would not issue a request to Splash to generate it's own har.I.e. I propose to drop the html and header fields, as they will either be taken from the
har
that is delivered as part of the input, or from the response issued to splash if nohar
was provided.