scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Make it easier to use Scrapy in Jupyter Notebook #4299

Open Gallaecio opened 4 years ago

Gallaecio commented 4 years ago

See http://gsoc2015.scrapinghub.com/ideas/#iphyton-ide for more information.

joybh98 commented 4 years ago

I'd like to take this, and as this is going to take some time, I'm open to pair with anyone else that wants to work on this

BisariaUtkarsh commented 4 years ago

I'd like to take this, and as this is going to take some time, I'm open to pair with anyone else that wants to work on this

Hey I would like to work on this issue. So is there any novice guide for beginners.

Gallaecio commented 4 years ago

I guess the first steps, in addition to getting familiar with Scrapy, would be to learn how to extend Jupyter Notebook, so that the proof-of-concept code from http://nbviewer.ipython.org/gist/kmike/9001574 makes some sense to you.

BisariaUtkarsh commented 4 years ago

I guess the first steps, in addition to getting familiar with Scrapy, would be to learn how to extend Jupyter Notebook, so that the proof-of-concept code from http://nbviewer.ipython.org/gist/kmike/9001574 makes some sense to you.

Hey i have been trying to run this notebook on binder as well as on my system but cant get past through errors.

On Binder the error is as follows : 1

On my system the error is as follows : 2 On my system I am unable to import any of the modules from scrapy.Any help would be much appreciated thanks in advance.

joybh98 commented 4 years ago

@BisariaUtkarsh would you like to work on this together ?

Gallaecio commented 4 years ago

@BisariaUtkarsh In “Binder” you seem to be missing lxml. I am not familiar with Binder, so I cannot tell you how, but you need to install lxml there. Lxml requires some C++ packages, so it may not be trivial to install. See https://lxml.de/installation.html

On your system, you are simply suffering the effects of Scrapy having evolved since that proof-of-concept code was initially written. scrapy.project was removed in Scrapy 1.6.0 (see https://docs.scrapy.org/en/latest/news.html). In the case of project, I don’t see it being used in that code, so you can probably just remove project from the imports. But if you run into similar issues with code that is actually used, you might need to check the release notes I’ve just linked and other parts of the Scrapy documentation to find a replacement.

BisariaUtkarsh commented 4 years ago

@BisariaUtkarsh would you like to work on this together ?

@joybhallaa I m not sure if there can be two applicants to a GSOC idea. We may have to confirm this with the concerned mentor I guess so. @Gallaecio can we do this??

BisariaUtkarsh commented 4 years ago

@BisariaUtkarsh In “Binder” you seem to be missing lxml. I am not familiar with Binder, so I cannot tell you how, but you need to install lxml there. Lxml requires some C++ packages, so it may not be trivial to install. See https://lxml.de/installation.html

On your system, you are simply suffering the effects of Scrapy having evolved since that proof-of-concept code was initially written. scrapy.project was removed in Scrapy 1.6.0 (see https://docs.scrapy.org/en/latest/news.html). In the case of project, I don’t see it being used in that code, so you can probably just remove project from the imports. But if you run into similar issues with code that is actually used, you might need to check the release notes I’ve just linked and other parts of the Scrapy documentation to find a replacement.

Hey @Gallaecio thanks for sharing the documentation it really helped a lot to get over some other issues. However I m stuck with "Broken Pipe Error" and couldn't find a work around on stackoverflow as well. Any suggestions to tackle this...

1

Gallaecio commented 4 years ago

I don’t think 2 students can work on the same idea for GSoC. I don’t know if @joybhallaa is planning to join GSoC this year, though.

@BisariaUtkarsh regarding your current error, it is hard to tell where it comes from, since your screenshot does not contain the whole traceback. Could you share the whole traceback as text?

Moreover, unless you share your changes that you made to https://nbviewer.jupyter.org/gist/kmike/9001574 to make it work with the latest version of Scrapy, it could take me a while to figure out those changes myself in order to try and reproduce your issue.

BisariaUtkarsh commented 4 years ago

I made several changes as per the documentation such as :

Removed project scapy.spider to scrapy.spiders BaseSpider to Spider scrapy.xlib.pydispatch to pydispatcher Queue() was replaced by multiprocessing.Queue()
HtmlXPathSelector was replaced by Selector

Here is the link to my notebook : https://github.com/BisariaUtkarsh/test_scrapy/blob/master/ipython-scrapy.ipynb

BisariaUtkarsh commented 4 years ago

Error :


BrokenPipeError Traceback (most recent call last)

in ----> 1 show_xpath('https://scrapinghub.com/crawlera', '//a[contains(text(), "i")]') in show_xpath(url, xpath) 32 33 def show_xpath(url, xpath): ---> 34 response = download(url) 35 hxs = Selector(response) 36 show_hxs_select(hxs, xpath) in download(url) 79 Download 'url' using Scrapy. Return Response. 80 """ ---> 81 response = _download(url) 82 return response.replace(body=set_base(response.body, url)) in _download(url) 64 spider = ResponseSpider(url) 65 crawler = CrawlerWorker(result_queue, spider) ---> 66 crawler.start() 67 item = result_queue.get()[0] 68 result_queue.cancel_join_thread() ~\Anaconda3\envs\DIP\lib\multiprocessing\process.py in start(self) 103 'daemonic processes are not allowed to have children' 104 _cleanup() --> 105 self._popen = self._Popen(self) 106 self._sentinel = self._popen.sentinel 107 # Avoid a refcycle if the target function holds an indirect ~\Anaconda3\envs\DIP\lib\multiprocessing\context.py in _Popen(process_obj) 221 @staticmethod 222 def _Popen(process_obj): --> 223 return _default_context.get_context().Process._Popen(process_obj) 224 225 class DefaultContext(BaseContext): ~\Anaconda3\envs\DIP\lib\multiprocessing\context.py in _Popen(process_obj) 320 def _Popen(process_obj): 321 from .popen_spawn_win32 import Popen --> 322 return Popen(process_obj) 323 324 class SpawnContext(BaseContext): ~\Anaconda3\envs\DIP\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj) 63 try: 64 reduction.dump(prep_data, to_child) ---> 65 reduction.dump(process_obj, to_child) 66 finally: 67 set_spawning_popen(None) ~\Anaconda3\envs\DIP\lib\multiprocessing\reduction.py in dump(obj, file, protocol) 58 def dump(obj, file, protocol=None): 59 '''Replacement for pickle.dump() using ForkingPickler.''' ---> 60 ForkingPickler(file, protocol).dump(obj) 61 62 # BrokenPipeError: [Errno 32] Broken pipe
Gallaecio commented 4 years ago

Have you tried searching the internet for both the exception class and the class raising it? “BrokenPipeError ForkingPickler”.

You might also want to try to run the code on as a regular Python script in your system, to see if the issue can be reproduced that way as well, or is specific to Jupyter Notebook.

joybh98 commented 4 years ago

@BisariaUtkarsh I inquired first to pick up this issue, I wanted someone to give me the nod as this was a gsoc issue and yes I am planning to join GSoC this year.

joybh98 commented 4 years ago

@BisariaUtkarsh at least reply.

BisariaUtkarsh commented 4 years ago

@joybhallaa Hey rn i m not sure if i will go forward with this issue so you may carry on with it.

never2average commented 4 years ago

Hey, @joybhallaa would you like to work with this issue? I would like to be of help.

joybh98 commented 4 years ago

@never2average I'm going to work on this issue, and as @Gallaecio said,2 people can't work on this issue.

never2average commented 4 years ago

Like what changes have you made, I would like to offer some suggestions though?

joybh98 commented 4 years ago

@never2average currently setting up development environment on my machine as @BisariaUtkarsh told me that he will not be working on this issue anymore.

joybh98 commented 4 years ago

@never2average I will be open to suggestions when someone gives me the approval, I prefer getting approval from members of an organizations as they are far more experienced and can let me know if my ideas are efficient or not, and can help the project grow.

Gallaecio commented 4 years ago

@never2average I'm going to work on this issue, and as @Gallaecio said,2 people can't work on this issue.

2 people cannot be selected for the same idea, but multiple students may submit proposals for the same idea. Anyone should feel free to work on a proposal for this or any other idea, regardless of other candidate students.

Gallaecio commented 4 years ago

@joybhallaa I believe you are going in the right direction, yes :slightly_smiling_face:

joybh98 commented 4 years ago

@Gallaecio :+1:

joybh98 commented 4 years ago

@Gallaecio what deliverables would you like to see in a proposal and what features are a must have? I am going to submit my first proposal today, would really like some input.

joybh98 commented 4 years ago

hey @Gallaecio @wRAR , whenever you're free, please take a look at my draft proposal. I will appreciate it

Gallaecio commented 4 years ago

@joybhallaa I’ve had a look at your proposal.

I see no mention of Twisted in your proposal. However, it was my impression that Scrapy being based on Twisted, and hence using a (non-restartable) Twisted reactor as an event loop, was one of the main issues that you face when using Scrapy within Jupyter Notebook. See this old proposal I’ve just found in the internet. Doesn’t that issue still exist? Will your proposal include work towards solving or easing that issue somehow?

Also, @BisariaUtkarsh had quite some trouble working through that old snippet, http://nbviewer.ipython.org/gist/kmike/9001574. Did you have better luck?

joybh98 commented 4 years ago

@Gallaecio Thanks for taking a look at my proposal. To be honest, I was not aware of the limitation of Twisted, I am looking it up now.

Also, @BisariaUtkarsh had quite some trouble working through that old snippet, http://nbviewer.ipython.org/gist/kmike/9001574. Did you have better luck?

I was able to run this old snippet on Google Collab with no problems.

joybh98 commented 4 years ago

@Gallaecio I have a question regarding twisted:

  1. Is it necessary to stop the reactor after crawlers have finished, or we can keep them running?
joybh98 commented 4 years ago

@Gallaecio I've updated my proposal with proposed changes, feel free to look at it now :) Let me know if you have any questions. EDIT: I've uploaded my final proposal, you can also check that out and suggest changes/improvements.