scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

[MRG+1] Support auth credentials in netloc for HTTP and FTP handlers #1466

Open umrashrf opened 8 years ago

umrashrf commented 8 years ago

Fixes #669

Solution provided by @dangra https://github.com/scrapy/scrapy/pull/670#issuecomment-38921769

I can add tests but should I add this dependency for FTP tests? https://github.com/giampaolo/pyftpdlib

Current FTP tests uses Twisted so I'm afraid they won't work for PY3 https://github.com/scrapy/scrapy/blob/master/tests/test_downloader_handlers.py#L523-L543

codecov-io commented 8 years ago

Codecov Report

Merging #1466 into master will decrease coverage by 1.29%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #1466     +/-   ##
=========================================
- Coverage   84.73%   83.43%   -1.3%     
=========================================
  Files         163      162      -1     
  Lines        9164     8760    -404     
  Branches     1362     1292     -70     
=========================================
- Hits         7765     7309    -456     
- Misses       1147     1201     +54     
+ Partials      252      250      -2
Impacted Files Coverage Δ
scrapy/downloadermiddlewares/httpauth.py 100% <100%> (ø) :arrow_up:
scrapy/downloadermiddlewares/auth.py 100% <100%> (ø)
scrapy/extensions/memusage.py 25% <0%> (-23.36%) :arrow_down:
scrapy/core/downloader/handlers/http.py 81.81% <0%> (-18.19%) :arrow_down:
scrapy/pipelines/images.py 79.68% <0%> (-10.54%) :arrow_down:
scrapy/resolver.py 80% <0%> (-9.48%) :arrow_down:
scrapy/pipelines/media.py 88.04% <0%> (-9.18%) :arrow_down:
scrapy/spidermiddlewares/referer.py 84.61% <0%> (-9.1%) :arrow_down:
scrapy/utils/log.py 81.01% <0%> (-7.5%) :arrow_down:
scrapy/mail.py 69.86% <0%> (-6.46%) :arrow_down:
... and 61 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fad6b70...702cd57. Read the comment docs.

umrashrf commented 8 years ago

thanks for reviewing @dangra, I'll do the changes soon.

dangra commented 8 years ago

hey, thank you @umrashrf for taking time to push your changes after so long.

redapple commented 8 years ago

@umrashrf , did you get a chance to continue on this?

umrashrf commented 8 years ago

@redapple sorry for late reply, I just saw your comment. Looks like I'll be able to work on this.

umrashrf commented 8 years ago

@dangra @redapple do you like it?

nyov commented 7 years ago

MRG+2 anyone? :D

redapple commented 7 years ago

I think docstrings and docs can still be improved aas it says "HTTP basic auth" everywhere, while it also handles FTP credentials.

nyov commented 7 years ago

If it takes another year to get the feature merged with updated docs, I'd rather see it merged now and the docs fixed in another patch. Too many rotting PRs around here :( That assumes the code is good, of course.

redapple commented 7 years ago

I have other comments on the implementation. I am working on a PR based on this one, which I'll submit for review to @umrashrf and others

redapple commented 7 years ago

@umrashrf , @nyov , @dangra , @eliasdorneles here are my suggestions: https://github.com/umrashrf/scrapy/pull/1

redapple commented 7 years ago

@nyov , I also updated the docs in https://github.com/umrashrf/scrapy/pull/1

nyov commented 7 years ago

@redapple, That almost looks like something completely different to the previous patch ;D You're amazing. Now that I finally got my local testsuite working again, I can hopefully leave a review on this in a bit.

nyov commented 7 years ago

@umrashrf, will you merge this into your PR? It would be great to see this through, if possible.

A generic Auth Middleware improvement I can think of, would be to cache all auth's. Not like the current spider attributes (http_*) but as a dict, {_realm_: _auth_}. This way spider can remember multiple logins for multiple domains. If HTTP realm + domain are the same, the same credentials-set can be sent back. For FTP, it would be domain/IP since FTP doesn't usually have different user auth's for different paths/folders.

But that's just a thought for a later patch perhaps, I don't mean to delay here with feature requests :)

umrashrf commented 7 years ago

@nyov @redapple I am aware of the new developments :) I will go through your comments and work on it over this weekend.

nyov commented 7 years ago

Okay, @redapple 's changes got merged into this PR.

Does anyone have a chance to test this "in the wild"?

kmike commented 7 years ago

It'd be nice to unify username/password extraction and building of HTTP header; https://github.com/scrapy/scrapy/pull/2530 also does this, but using a different code path.

kmike commented 7 years ago

I'm late to the party, but what about renaming it to UrlAuthMiddleware?

  1. This middleware is not handling other HTTP Basic and FTP auth sources like spider attributes, request.meta keys, settings or env variables - hm, it handles spider attributes for http...;
  2. it also dosn't handle other auth types like form submitting or OAuth.

Update: I can see now why it isn't named UrlAuthMiddleware

kmike commented 7 years ago

+1 to merge once https://github.com/scrapy/scrapy/pull/1466#issuecomment-281724763 is addressed

umrashrf commented 7 years ago
  1. it also doesn't handle other auth types like form submitting or OAuth.

By form submitting, did you mean FormRequest? And what is OAuth?

kmike commented 7 years ago

@umrashrf yeah, I meant FormRequest or something similar to https://github.com/TeamHG-Memex/autologin-middleware, and https://github.com/scrapy/scrapy/issues/40. But don't worry about this comment, I don't have a better name anyways :)

umrashrf commented 7 years ago

@kmike I don't see a reason why FormRequest won't work. Can you please give an example where you think it fails to work?

Also if you can give an example of the autologin-middleware and OAuth as how you would expect it work with AuthMiddleware, that'd be great.

kmike commented 7 years ago

@umrashrf my comment was only about names. autologin-middleware will work with AuthMiddleware.

AuthMiddleware handles auth, but it doesn't handle all kinds of auth, only a couple of specific auth cases. AuthMiddleware name is too broad and generic to my taste. autologin-middleware is an example of middleware which handles a different kind of auth, which is even more common than HTTP Basic Auth - cookie-based auth (submit login form, keep cookies, login again in case of logouts).

redapple commented 7 years ago

So @kmike, do you recommend keep HttpAuth middleware with the changes for credentials in uri, maybe rename it to HttpBasicAuth, and have a seperate one for FtpAuth perhaps?

Le 8 mars 2017 15:03, "Mikhail Korobov" notifications@github.com a écrit :

@umrashrf https://github.com/umrashrf my comment was only about names. autologin-middleware will work with AuthMiddleware.

AuthMiddleware handles auth, but it doesn't handle all kinds of auth, only a couple of specific auth cases. AuthMiddleware name is too broad and generic to my taste. autologin-middleware is an example of middleware which handles a different kind of auth, which is even more common than HTTP Basic Auth - cookie-based auth (submit login form, keep cookies, login again in case of logouts).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scrapy/scrapy/pull/1466#issuecomment-285048082, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2GGJLQ3bh4CUgFTcR-oF9EdFp_zeohks5rjrT8gaJpZM4F0iZU .

kmike commented 7 years ago

@redapple I haven't though about this option (use two middlewares), but now as you wrote it this sounds like a good idea, I like it.

umrashrf commented 7 years ago

How about BasicAuthMiddleware?

kmike commented 7 years ago

@umrashrf this name makes sense, but I also like @redapple's suggestion to create two middlewares: they have almost nothing in common.

redapple commented 7 years ago

For the record, the single middleware approach comes from @dangra 's comment here.

redapple commented 7 years ago

@umrashrf , @kmike, what do think of these options:

  1. create an UrlAuth middleware that 1) extracts user/pass from request urls, and 2) populate request.meta keys appropriately depending on scheme, 3) strip credentials from the urls
  2. leave HttpAuth middleware as-is (current master branch state)
  3. at HTTP downloader level, add Authorization header for HTTP (if missing), just like FTP downloader does

or,

  1. create an UrlAuth middleware (make it run before HttpAuth for outgoing requests)
  2. change HttpAuth to also handle http_user/pass from request meta

(/cc @dangra )

redapple commented 7 years ago

This apparently needs more discussion, so I'm moving this to milestone 1.5