Closed whalebot-helmsman closed 4 years ago
Is there a channel so we can talk about the issues and ideas or should we just comment here?
we just comment here?
I think it is great, because everyone can participate and didn't ask same questions twice.
Hi @whalebot-helmsman, I would love to work on Scrapy under GSoC this year. I have made a couple of pull requests. I am looking to contribute more to Scrapy before I start working on a proposal. Can you suggest me some issues or enhancements that you would like to see getting fixed/implemented?
Also, can you suggest me few fully compliant robots.txt parsers to look at.
@anubhavp28 current redirect middleware https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/redirect.py does not preserve redirect path. Sometimes there are several redirects for a single source URL. It will be great to preserve redirect history somewhere in meta
of request.
The most compliant robots.txt parser, known for me, is https://github.com/seomoz/reppy .
@whalebot-helmsman I guess, it has already been implemented. https://github.com/scrapy/scrapy/blob/c72ab1d4ba5dad3c68b12c473fa55b7f1f144834/scrapy/downloadermiddlewares/redirect.py#L35
@whalebot-helmsman I wanted to work on an issue from next week too. Can I suggest any bug/feature so I can begin working on? Tnx
https://github.com/scrapy/scrapy/labels/help%20wanted and https://github.com/scrapy/scrapy/labels/good%20first%20issue are good tags to find first project to work on
Sorry @anubhavp28, for unknown reason I decided this feature is not implemented. I will be more mindful next time.
@whalebot-helmsman According to ideas page An interface for robots.txt parsers that abstracts the existing parser, and permits a user to substitute a different parser
.
So we need to code many implementations (i.e parsers) of scrapy, so that user can substitute according to his need? Do we need many robotstxtmiddlewares to go with the parsers? I don't think there'd be room for different robots.txts built-in implementation on Scrapy
So we need to code many implementations (i.e parsers) of scrapy, so that user can substitute according to his need?
I was thinking about provide common interface and implementations for 2 existing parsers:
Writing custom parser is a stretch goal. We will write a custom parser after we finished common interface and implementations for existent parsers.
I have gone through seomoz/reppy and realised that it is better to write our own custom parser instead of depending on reppy, also urllib.robotparser
has some issues.
In this context:-
I was thinking about provide common interface and implementations for 2 existing parsers:
@whalebot-helmsman As you were saying about providing implementations for 2 existing parsers, are these 2 parsers:-
1) stdlib python parser
and reppy parser
(or)
2) stdlib python parser
and another one which is unknown
?
Since we aren't relying on reppy, we can't implement it!
I have gone through seomoz/reppy and realised that it is better to write our own custom parser instead of depending on reppy
What problem do you find with reppy parser?
The only problem I found with reppy was with its cpython dependency, rest things like wildcards and other implementations are handled well in reppy.
@whalebot-helmsman , Suppose we have the following directives for https://example.com/robots.txt
User-agent: *
Allow: /page
Disallow: /*.html
Will https://example.com/page.html
be crawled?
What is the order of precedence for rules with wildcards?
Should we follow the same standard rule first matching directive
here ?
Thanks :)
is better to write our own custom parser
There isn't enough time in GSoC project to write full robots.txt parser, I think. Interface and implementations for existent parsers are safer bet.
But you said:-
Writing custom parser is a stretch goal
So, improving robotparser is OK? Because I am finding many bugs with the current parser. I want to modify it.
stretch goal
It is additional goal, not first one. We can improve robotparser after we finish interface part, if there will be time.
@whalebot-helmsman , what does exactly interface do here, and what are the methods that needs to be implemented in the Interface class. And what are the inputs and outputs to this interface class? Can you give detailed description. Thanks :)
By ‘interface part’ I believe @whalebot-helmsman means ‘adding robotparser support to Scrapy’ in general, I don’t think he is speaking about a specific interface class.
We use robotparser
https://github.com/scrapy/scrapy/blob/820adb69c0b5ff0080826a9c7af1151973307b45/scrapy/downloadermiddlewares/robotstxt.py#L92 methods directly. It will be great to add abstract interface here, provide interface implementations for existent parsers, use more conformant parser from list of supported parsers at runtime.
@whalebot-helmsman So during runtime should the user be prompted with a choice in command line, so that the user can select from the list of parsers?
(or) I have another idea
We can add ROBOTSTXT_PARSER1_OBEY
and ROBOTSTXT_PARSER2_OBEY
variables in settings (default to False), which the user can switch according to his need?
Since the most conformant parser which I know till now is reppy. Can we use reppy parser now in this project?
Wouldn’t a ROBOTSTXT_PARSER
setting suffice? One that lets you choose which parser to use, which uses the current parser by default.
Hey @whalebot-helmsman @Gallaecio, I have done some work on the interface part of my GSoC proposal (https://docs.google.com/document/d/1l740K3MXcJ7SJN2oN0MNitM_Udzz42_8uHLAgWbhXsg/edit?usp=sharing ). I am looking for feedback on the approach I have used for creating the interface. I am also facing a couple of problems, I have listed them below :
I am facing difficulties in finding another popular python based robots parser (other than reppy). Can you suggest me some python based robots parsers other than reppy? I have found few parsers that are implemented in golang, c++ & node, but I am finding it difficult to call those parsers form python. For example, robots parser (golang) (the parser I mentioned in the proposal along with reppy). I tried compiling robots parser (golang) to a C shared library, hoping I would be able to call it form python using ctypes. I don't have previous experience in this and I might be wrong, I think since the parser uses go specific types hence it cannot be compiled to a C shared library. Hence, it cannot be called from python unless I work on rewriting parts of the parser with C compatible types. Is there any other way of calling go functions from python?
I have seen that a lot directives that has been added since the original 1994 spec and 1997 draft of robots.txt . Hence, it can easily be the case that a parser may not support all of them. How should we handle the case of a parser not supporting a certain directive. For example, Host
directive may not be supported by all parsers and say, the interface spec recommends keeping a preferred_host()
function, then should calling preferred_host()
raise some custom exception notifying that the parser doesn't support it or should it just act like as if there is no preferred domain is specified with host
directive in robots.txt?
Quoting its documentation
The module has two classes: RobotExclusionRulesParser and RobotFileParserLookalike. The latter offers all the features of the former and also bolts on an API that makes it a drop-in replacement for the standard library's robotparser.RobotFileParser.
I have already looked into it. I thought since I am already implementing the interface on top of Python's inbuilt robotparser.RobotFileParser
(which Scrapy uses currently), implementing the interface on top of rerp
requires little effort (as both have almost the same API) and can be done by the user. Instead, shouldn't we focus on providing support for parsers with different API? Maybe, we can work on providing support for repr
too towards the end of GSoC period, if we have time remaining.
https://docs.google.com/document/d/1ZhM4YN-OwOjQngEXVs1pCqhNyzgT2sLYjpnD-FH-MMg/edit (docs)
This is my final draft proposal to this project. Can you please review it and If there are any suggestions/modifications, please do let me know @whalebot-helmsman @Gallaecio :) :+1: Thanks :)
I have seen that both rerp
and stdlib parser
have problems with wildcards. We should find a pure python parser, which should have different API to the current urllib.robotparser
Thanks @maramsumanth, I read your draft quickly. There is no show-stoppers. @anubhavp28, I want to remind you about formal part of proposal. Except this, there is no problems. Let me take more time and think about your drafts more concisely, I will communicate final feedback on Monday.
Thanks @whalebot-helmsman :) :+1:
@whalebot-helmsman can I share my draft proposal with PSF in the GSoC website now ? Also I have improved my project proposal by making necessary changes. Thanks :)
Hi @whalebot-helmsman, @lopuhin and @Gallaecio , I followed the scrapy's work for some time, and finally opened an opportunity to contribute to the project at GSoC this year. Could you forward me some improvement or bug that I would like to implement? If there were also some examples of compatible robots.txt would help a lot.
@killdary Reading this thread, including the issue links on the description, should give you a good starting point.
@whalebot-helmsman , is my proposal sufficient or should I make any changes. Waiting for your feedback Thanks for reviewing :) :+1:
@whalebot-helmsman @lopuhin , I think there are hardly 6 days left for the deadline. Can you please look into my proposal and give feedback, so that I will be able to improve it based on your suggestions. Thanks :)
@maramsumanth As I stated earlier, your proposal is great. I reread it and didn't find any new problems. Relax and wait for accepted proposal announcement.
Thank you @whalebot-helmsman! :) :+1:
Hey @whalebot-helmsman @Gallaecio, can you review my proposal again? @whalebot-helmsman I have modified my proposal to follow the template. I have tried to come up with pseudocode for the parser as well. Do you have any preference between improving CPython robotparser and writing a parser?
Is there any topic, I should be more clear about or think more about in my proposal?
Proposal : https://docs.google.com/document/d/1jAOwcyFHJL-I--0RQ6JMuRgfTYBIz5QyAP2oBVdz3eg/edit?usp=sharing
@anubhavp28 I checked your proposal and suggested some grammatical edits. In general it looks good for me.
Do you have any preference between improving CPython robotparser and writing a parser?
If you want to contribute your improvements to upstream CPython, it is better to improve robotparser. Beside that, there is no advantages. Contributing to upstream is a time-consuming task in itself. I don't have any preferences here.
@whalebot-helmsman , I have a doubt regarding order of Allow/Disallow directives (with wildcards) under a particular user-agent. What is the order of precedence for rules with wildcards? Google's Implementation says it is undefined. I want to add this improvement in project for cpython's robotparser, what should we consider? Is there any standard rule for directives with wildcards?
Allow: /page Disallow: /*.htm
https://example.com/page.htm is allowed? Thanks :)
Thanks for reviewing my proposal @whalebot-helmsman :)
I want to add this improvement in project for cpython's robotparser, what should we consider?
Robots.txt protocol was created as a contract between crawlers and site owners. Obeying robots.txt scrapy declares itself as a polite crawler. To be polite in all respects in a a case of confusion URLs should be disallowed.
Thanks @whalebot-helmsman , I will include it in my proposal :)
Hey @whalebot-helmsman @Gallaecio, I'm super excited to work on Scrapy this summer . What should we use for communication now? Should we use a separate issue or email?
Congratulations @anubhavp28. I am on vacation now. Can we get back to this question next Tuesday or Wednesday?
Sure :)
@anubhavp28 In the meantime, though, feel free to contact us by email or through the #scrapy channel in IRC. And enjoy the community bonding period!
@whalebot-helmsman Shall we close this?
Yes, definitely @Gallaecio
This issue is a single place for all students and mentors to discuss ideas and proposals for Support for Different robots.txt Parsers GSoC project. First of all, every student involved should have not very big contribution to https://github.com/scrapy/scrapy project. It should not be very big, just to get your hands dirty and get accustomed to processes and tools. Contribution should be done in a form of open Pull Request to solve a problem not related to robots.txt project. You can read open issues or open PRs and choose one for yourself. Or you can ask here and mentors and contributors will some recommendations.
Problems for current robots.txt implementation can be traced in relevant issues:
892
Previous attempts to fix issues can be seen in relevant PRs:
Ask for more details in comments