scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

[GSoC 2019] Support for Different robots.txt Parsers #3656

Closed whalebot-helmsman closed 4 years ago

whalebot-helmsman commented 5 years ago

This issue is a single place for all students and mentors to discuss ideas and proposals for Support for Different robots.txt Parsers GSoC project. First of all, every student involved should have not very big contribution to https://github.com/scrapy/scrapy project. It should not be very big, just to get your hands dirty and get accustomed to processes and tools. Contribution should be done in a form of open Pull Request to solve a problem not related to robots.txt project. You can read open issues or open PRs and choose one for yourself. Or you can ask here and mentors and contributors will some recommendations.

Problems for current robots.txt implementation can be traced in relevant issues:

Previous attempts to fix issues can be seen in relevant PRs:

Ask for more details in comments

arvand commented 5 years ago

Is there a channel so we can talk about the issues and ideas or should we just comment here?

whalebot-helmsman commented 5 years ago

we just comment here?

I think it is great, because everyone can participate and didn't ask same questions twice.

anubhavp28 commented 5 years ago

Hi @whalebot-helmsman, I would love to work on Scrapy under GSoC this year. I have made a couple of pull requests. I am looking to contribute more to Scrapy before I start working on a proposal. Can you suggest me some issues or enhancements that you would like to see getting fixed/implemented?

Also, can you suggest me few fully compliant robots.txt parsers to look at.

whalebot-helmsman commented 5 years ago

@anubhavp28 current redirect middleware https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/redirect.py does not preserve redirect path. Sometimes there are several redirects for a single source URL. It will be great to preserve redirect history somewhere in meta of request. The most compliant robots.txt parser, known for me, is https://github.com/seomoz/reppy .

anubhavp28 commented 5 years ago

@whalebot-helmsman I guess, it has already been implemented. https://github.com/scrapy/scrapy/blob/c72ab1d4ba5dad3c68b12c473fa55b7f1f144834/scrapy/downloadermiddlewares/redirect.py#L35

arvand commented 5 years ago

@whalebot-helmsman I wanted to work on an issue from next week too. Can I suggest any bug/feature so I can begin working on? Tnx

whalebot-helmsman commented 5 years ago

https://github.com/scrapy/scrapy/labels/help%20wanted and https://github.com/scrapy/scrapy/labels/good%20first%20issue are good tags to find first project to work on

whalebot-helmsman commented 5 years ago

Sorry @anubhavp28, for unknown reason I decided this feature is not implemented. I will be more mindful next time.

maramsumanth commented 5 years ago

@whalebot-helmsman According to ideas page An interface for robots.txt parsers that abstracts the existing parser, and permits a user to substitute a different parser.

So we need to code many implementations (i.e parsers) of scrapy, so that user can substitute according to his need? Do we need many robotstxtmiddlewares to go with the parsers? I don't think there'd be room for different robots.txts built-in implementation on Scrapy

whalebot-helmsman commented 5 years ago

So we need to code many implementations (i.e parsers) of scrapy, so that user can substitute according to his need?

I was thinking about provide common interface and implementations for 2 existing parsers:

Writing custom parser is a stretch goal. We will write a custom parser after we finished common interface and implementations for existent parsers.

maramsumanth commented 5 years ago

I have gone through seomoz/reppy and realised that it is better to write our own custom parser instead of depending on reppy, also urllib.robotparser has some issues.

In this context:-

I was thinking about provide common interface and implementations for 2 existing parsers:

@whalebot-helmsman As you were saying about providing implementations for 2 existing parsers, are these 2 parsers:- 1) stdlib python parser and reppy parser (or) 2) stdlib python parser and another one which is unknown ?

Since we aren't relying on reppy, we can't implement it!

whalebot-helmsman commented 5 years ago

I have gone through seomoz/reppy and realised that it is better to write our own custom parser instead of depending on reppy

What problem do you find with reppy parser?

maramsumanth commented 5 years ago

The only problem I found with reppy was with its cpython dependency, rest things like wildcards and other implementations are handled well in reppy.

maramsumanth commented 5 years ago

@whalebot-helmsman , Suppose we have the following directives for https://example.com/robots.txt

User-agent: *
Allow: /page
Disallow: /*.html

Will https://example.com/page.html be crawled? What is the order of precedence for rules with wildcards? Should we follow the same standard rule first matching directive here ? Thanks :)

whalebot-helmsman commented 5 years ago

is better to write our own custom parser

There isn't enough time in GSoC project to write full robots.txt parser, I think. Interface and implementations for existent parsers are safer bet.

maramsumanth commented 5 years ago

But you said:-

Writing custom parser is a stretch goal

So, improving robotparser is OK? Because I am finding many bugs with the current parser. I want to modify it.

whalebot-helmsman commented 5 years ago

stretch goal

It is additional goal, not first one. We can improve robotparser after we finish interface part, if there will be time.

maramsumanth commented 5 years ago

@whalebot-helmsman , what does exactly interface do here, and what are the methods that needs to be implemented in the Interface class. And what are the inputs and outputs to this interface class? Can you give detailed description. Thanks :)

Gallaecio commented 5 years ago

By ‘interface part’ I believe @whalebot-helmsman means ‘adding robotparser support to Scrapy’ in general, I don’t think he is speaking about a specific interface class.

whalebot-helmsman commented 5 years ago

We use robotparser https://github.com/scrapy/scrapy/blob/820adb69c0b5ff0080826a9c7af1151973307b45/scrapy/downloadermiddlewares/robotstxt.py#L92 methods directly. It will be great to add abstract interface here, provide interface implementations for existent parsers, use more conformant parser from list of supported parsers at runtime.

maramsumanth commented 5 years ago

@whalebot-helmsman So during runtime should the user be prompted with a choice in command line, so that the user can select from the list of parsers? (or) I have another idea We can add ROBOTSTXT_PARSER1_OBEY and ROBOTSTXT_PARSER2_OBEY variables in settings (default to False), which the user can switch according to his need?

Since the most conformant parser which I know till now is reppy. Can we use reppy parser now in this project?

Gallaecio commented 5 years ago

Wouldn’t a ROBOTSTXT_PARSER setting suffice? One that lets you choose which parser to use, which uses the current parser by default.

anubhavp28 commented 5 years ago

Hey @whalebot-helmsman @Gallaecio, I have done some work on the interface part of my GSoC proposal (https://docs.google.com/document/d/1l740K3MXcJ7SJN2oN0MNitM_Udzz42_8uHLAgWbhXsg/edit?usp=sharing ). I am looking for feedback on the approach I have used for creating the interface. I am also facing a couple of problems, I have listed them below :

  1. I am facing difficulties in finding another popular python based robots parser (other than reppy). Can you suggest me some python based robots parsers other than reppy? I have found few parsers that are implemented in golang, c++ & node, but I am finding it difficult to call those parsers form python. For example, robots parser (golang) (the parser I mentioned in the proposal along with reppy). I tried compiling robots parser (golang) to a C shared library, hoping I would be able to call it form python using ctypes. I don't have previous experience in this and I might be wrong, I think since the parser uses go specific types hence it cannot be compiled to a C shared library. Hence, it cannot be called from python unless I work on rewriting parts of the parser with C compatible types. Is there any other way of calling go functions from python?

  2. I have seen that a lot directives that has been added since the original 1994 spec and 1997 draft of robots.txt . Hence, it can easily be the case that a parser may not support all of them. How should we handle the case of a parser not supporting a certain directive. For example, Host directive may not be supported by all parsers and say, the interface spec recommends keeping a preferred_host() function, then should calling preferred_host() raise some custom exception notifying that the parser doesn't support it or should it just act like as if there is no preferred domain is specified with host directive in robots.txt?

Gallaecio commented 5 years ago
  1. I’ve only found http://nikitathespider.com/python/rerp/
anubhavp28 commented 5 years ago

Quoting its documentation

The module has two classes: RobotExclusionRulesParser and RobotFileParserLookalike. The latter offers all the features of the former and also bolts on an API that makes it a drop-in replacement for the standard library's robotparser.RobotFileParser.

I have already looked into it. I thought since I am already implementing the interface on top of Python's inbuilt robotparser.RobotFileParser (which Scrapy uses currently), implementing the interface on top of rerp requires little effort (as both have almost the same API) and can be done by the user. Instead, shouldn't we focus on providing support for parsers with different API? Maybe, we can work on providing support for repr too towards the end of GSoC period, if we have time remaining.

maramsumanth commented 5 years ago

https://docs.google.com/document/d/1ZhM4YN-OwOjQngEXVs1pCqhNyzgT2sLYjpnD-FH-MMg/edit (docs)

This is my final draft proposal to this project. Can you please review it and If there are any suggestions/modifications, please do let me know @whalebot-helmsman @Gallaecio :) :+1: Thanks :)

maramsumanth commented 5 years ago

I have seen that both rerp and stdlib parser have problems with wildcards. We should find a pure python parser, which should have different API to the current urllib.robotparser

whalebot-helmsman commented 5 years ago

Thanks @maramsumanth, I read your draft quickly. There is no show-stoppers. @anubhavp28, I want to remind you about formal part of proposal. Except this, there is no problems. Let me take more time and think about your drafts more concisely, I will communicate final feedback on Monday.

maramsumanth commented 5 years ago

Thanks @whalebot-helmsman :) :+1:

maramsumanth commented 5 years ago

@whalebot-helmsman can I share my draft proposal with PSF in the GSoC website now ? Also I have improved my project proposal by making necessary changes. Thanks :)

killdary commented 5 years ago

Hi @whalebot-helmsman, @lopuhin and @Gallaecio , I followed the scrapy's work for some time, and finally opened an opportunity to contribute to the project at GSoC this year. Could you forward me some improvement or bug that I would like to implement? If there were also some examples of compatible robots.txt would help a lot.

Gallaecio commented 5 years ago

@killdary Reading this thread, including the issue links on the description, should give you a good starting point.

maramsumanth commented 5 years ago

@whalebot-helmsman , is my proposal sufficient or should I make any changes. Waiting for your feedback Thanks for reviewing :) :+1:

maramsumanth commented 5 years ago

@whalebot-helmsman @lopuhin , I think there are hardly 6 days left for the deadline. Can you please look into my proposal and give feedback, so that I will be able to improve it based on your suggestions. Thanks :)

whalebot-helmsman commented 5 years ago

@maramsumanth As I stated earlier, your proposal is great. I reread it and didn't find any new problems. Relax and wait for accepted proposal announcement.

maramsumanth commented 5 years ago

Thank you @whalebot-helmsman! :) :+1:

anubhavp28 commented 5 years ago

Hey @whalebot-helmsman @Gallaecio, can you review my proposal again? @whalebot-helmsman I have modified my proposal to follow the template. I have tried to come up with pseudocode for the parser as well. Do you have any preference between improving CPython robotparser and writing a parser?

Is there any topic, I should be more clear about or think more about in my proposal?

Proposal : https://docs.google.com/document/d/1jAOwcyFHJL-I--0RQ6JMuRgfTYBIz5QyAP2oBVdz3eg/edit?usp=sharing

whalebot-helmsman commented 5 years ago

@anubhavp28 I checked your proposal and suggested some grammatical edits. In general it looks good for me.

Do you have any preference between improving CPython robotparser and writing a parser?

If you want to contribute your improvements to upstream CPython, it is better to improve robotparser. Beside that, there is no advantages. Contributing to upstream is a time-consuming task in itself. I don't have any preferences here.

maramsumanth commented 5 years ago

@whalebot-helmsman , I have a doubt regarding order of Allow/Disallow directives (with wildcards) under a particular user-agent. What is the order of precedence for rules with wildcards? Google's Implementation says it is undefined. I want to add this improvement in project for cpython's robotparser, what should we consider? Is there any standard rule for directives with wildcards?

Allow: /page Disallow: /*.htm

https://example.com/page.htm is allowed? Thanks :)

anubhavp28 commented 5 years ago

Thanks for reviewing my proposal @whalebot-helmsman :)

whalebot-helmsman commented 5 years ago

I want to add this improvement in project for cpython's robotparser, what should we consider?

Robots.txt protocol was created as a contract between crawlers and site owners. Obeying robots.txt scrapy declares itself as a polite crawler. To be polite in all respects in a a case of confusion URLs should be disallowed.

maramsumanth commented 5 years ago

Thanks @whalebot-helmsman , I will include it in my proposal :)

anubhavp28 commented 5 years ago

Hey @whalebot-helmsman @Gallaecio, I'm super excited to work on Scrapy this summer . What should we use for communication now? Should we use a separate issue or email?

whalebot-helmsman commented 5 years ago

Congratulations @anubhavp28. I am on vacation now. Can we get back to this question next Tuesday or Wednesday?

anubhavp28 commented 5 years ago

Sure :)

Gallaecio commented 5 years ago

@anubhavp28 In the meantime, though, feel free to contact us by email or through the #scrapy channel in IRC. And enjoy the community bonding period!

Gallaecio commented 4 years ago

@whalebot-helmsman Shall we close this?

whalebot-helmsman commented 4 years ago

Yes, definitely @Gallaecio