scrapinghub / web-poet

Web scraping Page Objects core library
https://web-poet.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
95 stars 15 forks source link

Add base_url to Selector and implement HttpRequest.from_form #205

Closed Gallaecio closed 5 months ago

Gallaecio commented 5 months ago

HttpRequest.from_form is based on Scrapy’s FormRequest.from_response(), but with a minimal API and without support for the click thing which I don’t fully understand.

codecov[bot] commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 98.69%. Comparing base (6bfb58f) to head (94edaed). Report is 16 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #205 +/- ## ========================================== + Coverage 98.51% 98.69% +0.18% ========================================== Files 32 32 Lines 1479 1532 +53 ========================================== + Hits 1457 1512 +55 + Misses 22 20 -2 ``` | [Files](https://app.codecov.io/gh/scrapinghub/web-poet/pull/205?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapinghub) | Coverage Δ | | |---|---|---| | [web\_poet/mixins.py](https://app.codecov.io/gh/scrapinghub/web-poet/pull/205?src=pr&el=tree&filepath=web_poet%2Fmixins.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapinghub#diff-d2ViX3BvZXQvbWl4aW5zLnB5) | `98.27% <100.00%> (-1.73%)` | :arrow_down: | | [web\_poet/page\_inputs/http.py](https://app.codecov.io/gh/scrapinghub/web-poet/pull/205?src=pr&el=tree&filepath=web_poet%2Fpage_inputs%2Fhttp.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapinghub#diff-d2ViX3BvZXQvcGFnZV9pbnB1dHMvaHR0cC5weQ==) | `100.00% <100.00%> (ø)` | | ... and [3 files with indirect coverage changes](https://app.codecov.io/gh/scrapinghub/web-poet/pull/205/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapinghub)
kmike commented 5 months ago

Hey! I think that's definitely a feature we should have in web-poet 👍

How hard would it be to move the form submission code to a separate library, and then use it both in Scrapy and in web-poet, to avoid double-maintenance? Is it the "click" support which makes it hard?

Gallaecio commented 5 months ago

I did not even implement the click support here, but I don’t imagine it would make it hard.

The main reason why I did not move this to w3lib is that the function builds a request. If we move it to w3lib, what should we return? A named tuple? What should be the format of headers, a dict of values, a dict of lists, or a tuple of tuples?

kmike commented 5 months ago

The main reason why I did not move this to w3lib

I was thinking about a separate library like itemloaders or itemadapter, though w3lib also works.

If we move it to w3lib, what should we return? A named tuple? What should be the format of headers, a dict of values, a dict of lists, or a tuple of tuples?

That's a good point. It should probably return a dataclass in this case. The headers should be a either a list of tuples or a dict of lists, probably doesn't matter much (as dicts use insert order now). It doesn't have to be fancy like in web-poet - no need for case-sensitivity, etc.

kmike commented 5 months ago

About click - as I understand, it's needed when there are multiple submit (or other?) buttons on a form, and the right one is not the first one (which Scrapy picks itself). For example, if you use Formasaurus to submit a search form, you may want to use the detected "submit button", not "reset/clear button" (see https://formasaurus.readthedocs.io/en/latest/usage.html#field-types), though it needs experiments.

wRAR commented 5 months ago

It's too short for a separate new library IMO.

kmike commented 5 months ago

It's too short for a separate new library IMO.

My 2c: it's not big here, but

  1. the code is rather complex,
  2. the full implementation is larger (https://github.com/scrapy/scrapy/blob/master/scrapy/http/request/form.py),
  3. we need to count in tests.

So, overall, it seems it'd get us to 500+ lines of non-obvious code, which should be enough for a library with a well-defined purpose: given html form, return a request to submit it.

Gallaecio commented 5 months ago

So, between a new w3lib.form module and a new (form2request?) package, any preference?

kmike commented 5 months ago

My preference (+0.5) is a new library in scrapy org.