rrtest create runs forever

mnm678 commented 5 years ago

When making a test for Python's scrapy library, the rrtest create command runs forever. It looks like it's getting stuck in Python's subprocess library.

Here is the command rrtest create --name scrapy --command "scrapy runspider test.py" And the contents of test.py:

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            yield {'title': title.css('a ::text').extract_first()}

        for next_page in response.css('div.prev-post > a'):
            yield response.follow(next_page, self.parse)

ex0dus-0x commented 5 years ago

Hey, sorry for the late response!

I attempted to replicate through rrtest, and thought that initially it was a problem with subprocess.Popen hanging as a result of the scrapy process never returning a status code to terminate its parent. However, running this with just rr yielded:

$ rr record -n scrapy runspider test.py
rr: Saving execution to trace directory `/home/crashsim/.local/share/rr/scrapy-1'.
2018-12-03 20:39:26 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-12-03 20:39:26 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.12
(default, Nov 12 2018, 14:36:49) - [GCC 5.4.0 20160609], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j  20 Nov 2018), cryptography 2.4.2, Platform Linux-4.4.0-131-generic-i686-with-Ubuntu-16.04-xenial
Traceback (most recent call last):
  File "/home/crashsim/.local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/home/crashsim/.local/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/home/crashsim/.local/lib/python2.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "/home/crashsim/.local/lib/python2.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
    cmd.run(args, opts)
  File "/home/crashsim/.local/lib/python2.7/site-packages/scrapy/commands/runspider.py", line 80, in run
    module = _import_file(filename)
  File "/home/crashsim/.local/lib/python2.7/site-packages/scrapy/commands/runspider.py", line 21, in _import_file
    module = import_module(fname)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/home/crashsim/test.py", line 14
    return 0
SyntaxError: 'return' with argument inside generator

Judging from the traceback, it seems that scrapy itself was spawning off another process, and rr was unfortunately unable to record execution of that child process. I believe this is the corresponding limitation from their project website:

cannot record processes that share memory with processes outside the recording tree. This is an inherent feature of the design. rr automatically disables features such as X shared memory for recorded processes to avoid this problem.

(@pkmoore @alyptik thoughts?)

With that said, if it possible to instantiate scrapy as a single process (i.e directly create the caller in binary without the need for the scrapy runspider command), that might be the next step to try.

pkmoore commented 5 years ago

I wasn't aware of this limitation but it makes sense. I think this is something we need to document but I don't think its on us to fix it in rr.

pkmoore / rrapper

rrtest create runs forever #66