jschnurr / scrapyscript

Run a Scrapy spider programmatically from a script or a Celery task - no project required.
MIT License
121 stars 26 forks source link

Issue with "return self.results.get()" in "Processor().run()" causing processing to hang forever #3

Closed JoeJasinski closed 2 years ago

JoeJasinski commented 7 years ago

Thank you for writing scrapyscript; it's been very helpful!

However, I have a script that looks something like the below, written in Python 3.5. I noticed that when I run Process().run() as below, the spider runs to completion, but hangs after the scrape is done (which causes my Celery 4 jobs to hang and never finish).

I verified that in the scrapyscript code, in the run() method, processing makes it through p.start(), p.join(), p.terminate(), but hangs on the return statement. I noticed that the return statement is looking up results in a Queue. If I comment out the return statement (I personally don't care about the returned results), the processing finishes.

from scrapy.utils.project import get_project_settings

from scrapyscript import Job, Processor
from myproject.spiders.myspider import MySpider

scraper_args = dict(arg1="1", arg2="2")

config = get_project_settings()
spider = MySpider(**scraper_args)
# The api for scrapyscript requires us to pass in the
# scraper_args a second time. The constructor for the spider
# is called twice: once above and once again in the Job.
job = Job(spider, payload=scraper_args)
Processor(settings=config).run(job)

Impacted area: https://github.com/jschnurr/scrapyscript/blob/master/scrapyscript.py#L118

(As an aside, I also overrode the init() method in my spider to accept arguments, but noticed that I have to pass in the custom arguments to both MySpider and as the payload, but maybe I should open a different ticket for that?)

zknicker commented 6 years ago

+1

jschnurr commented 6 years ago

@JoeJasinski, I was not able to reproduce the problem. However, I've made some other changes based on your feedback; can you please review v1.0.0 and post a complete, non-working example if there is still an issue?

vidakDK commented 6 years ago

@jschnurr I have the same issue - scrapy job hangs forever on the process join line, here:

.env/lib/python3.6/site-packages/billiard/popen_fork.py in wait(self, timeout)
     55                     return None
     56             # This shouldn't block if wait() returned successfully.
---> 57             return self.poll(os.WNOHANG if timeout == 0.0 else 0)
     58         return self.returncode
     59 

This happens only when I include creation of an HTMLParser for some reason, I assume that some memory is not being released or something.

Here's the offending code that makes the process hang:

def _parse_historical_data(self, response):
        data = response.xpath('//tr[@class="text-right"]').extract()
        hist_data = []
        if data:
            for item in data:
                p = HistoricalDataHTMLParser()
                p.feed(item)
                hist_data.append(p.tds)
                # p.close()
                # del p
        return hist_data

class HistoricalDataHTMLParser(HTMLParser):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.in_td = False
        self.tds = []

    def handle_starttag(self, tag, attrs):
        if tag == 'td':
            self.in_td = True

    def handle_data(self, data):
        if self.in_td:
            self.tds.append(data.strip())

    def handle_endtag(self, tag):
        self.in_td = False

calling _parse_historical_data method causes the above hang. These close() and del lines aren't helping.

Note: I've had the same problem happen in another scraper, where some PDF parser was involved which uses the pdfminer.six library.

UPDATE: I believe that the problem lies in returning the scraped data. If, for any reason, the returned data is bound to Scrapy objects, the memory will not be released and sub-process will never be joined. I solved the problem by simply storing the scraped data to the database in Scrapy instead of returning the data using the "run" method of Job class.

jschnurr commented 6 years ago

@vidakDK what were you trying to return?

This rather contrived example exhibits the same symptom; the spider runs to completion (logging INFO: Spider closed (finished), but never exits:

class TestBadSpider(Spider):
    name = 'badspider'

    def start_requests(self):
        yield scrapy.Request('http://www.python.org')

    def parse(self, response):
        return {'payload': response}

The same thing happens using Nosetests and Scrapyscript, but not with scrapy runspider.

What I don't understand is why we would ever get into this situation; why are you returning data bound to Scrapy objects?

jschnurr commented 6 years ago

Closed due to inactivity.

vidakDK commented 6 years ago

@jschnurr Well Scrapyscript combines the results of multiple calls into a single return object, which I then wanted to store in the database, in order to split up the scraping and database actions in the code.

However, Scrapyscript was unable to end jobs when its response was tied to outside objects, so I added database storing logic to each call, and removed all data from "return" option of Scrapyscript.

I would either try to edit the logic of those twisted calls to be able to make some deep copies of data which would enable it to kill the process normally, or just remove the return functionality.

Update: I don't have the time to test this now, but the problem could be related to using Scrapy pipelines. If we test the example from the project readme file, but instead add some basic code in pipelines.py that returns the data instead of the spider itself, do we get the hanging process problem or does it work normally?

DanielMenke commented 5 years ago

@jschnurr, @vidakDK

I can confirm this issue and also experienced that it has something to do with the size of the data that is returned by processor.run(job) I send all scraped data through a pipeline, which maps it to a database and then returns this converted data to the main process. If I remove the last return statement in my pipeline, the problem disappears. In Addition, the problem only occurs, when I scrape bigger amounts of data. It seems to get critical(on my machine) when the size of returned data exceeds ~ 50000 bytes. This is a totally fine cause to fail, but it would be nice to get some Exception, if it happens.

jschnurr commented 3 years ago

@vidakDK this does seem to happen when data returned from Spiders is bound to Scrapy objects, as you suggested. As a simple example, returning scrapy.Item() objects from a basic Spider hangs, while returning a dict with the same data does not.

For now, I recommend returning dict from Spiders to workaround the problem.

pgalilea commented 3 years ago

Same problem here when I try to run from a Flask endpoint. It hangs forever in Processor().run(), even when I try with/without returning values (dict) from the scraper. As a standalone script, It works.

covuworie commented 2 years ago

I have encountered the same problem as others here. I took a copy of the code and found that it hangs indefinitely either when enqueing results or dequeing results: e.g on self.results.put(self.items) or return self.results.get(). I have seen that this is code in billiard queues.

As others have noted it hangs either when a scrapy.Item or dict is returned from the parse method in scrapy. So @vidakDK the workaround you mentioned doesn't work in all situations. I think I saw a deadlock Exception in my IDE at somepoint, but I no longer see that and the code just hangs.

Also as noted above by someone it is something to do with the size of the data returned. There is a threshold data size below which the run completes irrespective of whether a scrapy item or dict is returned from the parse method.

One "workaround" which did work consistently for me when using scrapy items was to update the scrapyscript code to convert the item to a dict in the _item_scraped method instead: self.items.append(dict(item)). The code does not hang with this change. Of course this change is not ideal as the expected return type is being changed from a scrapy item to a dict.

No idea why the code does not hang with this change and does hang if a dict is returned directly from the parse method in scrapy. Maybe others have an idea what is going on? I feel that this is a critical defect and I'm surprised it hasn't been fixed for many years.

covuworie commented 2 years ago

I just had a look at this again and think I've got a fix @vidakDK. The problem seems to be a deadlock. The python documentation states:

"that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe"

and further says:

"This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined"

and then gives the following example of a deadlock:

from multiprocessing import Process, Queue

def f(q):
    q.put('X' * 1000000)

if __name__ == '__main__':
    queue = Queue()
    p = Process(target=f, args=(queue,))
    p.start()
    p.join()                    # this deadlocks
    obj = queue.get()

See Joining processes that use queues.

This is the same situation that is in the run method code. Switching the code around to read like below allows the code to run to run to completion returning the results:

p = Process(target=self._crawl, args=[jobs])
p.start()
# p.join()
# p.terminate()

res = self.results.get()
p.join()
p.terminate()
return res

The code needs to be tested with multiple jobs as I was not looking at that scenario.

More information about the deadlock situation is given in the second pink box in the pipes and queues python documentation. It states an alternative way is to use a manager as using one does not have the same problem.

As an aside, the data size issue seems like it is related to the size of the buffer as described in this stackoverflow thread.

christosmito commented 2 years ago

@covuworie Could you provide a more detailed explanation about how to manage to make the code work? I have built a scraper for google patents in order to extract some fields, and when the description fileld is big I take the hang forever situation. My code is this: `import logging

import scrapy from scrapyscript import Job, Processor

processor = Processor(settings=None)

class PythonSpider(scrapy.spiders.Spider): name = "myspider"

def start_requests(self):
    yield scrapy.Request(self.url)

def parse(self, response):
    data = str(''.join(response.xpath('//section[@itemprop="description"]/descendant::*/text()').getall()).replace('\n', '').replace('  ', '').strip())
    return {'title': data}

for this https://patents.google.com/patent/EP2348838B1/en?oq=EP2348838B1 terminates normally

job = Job(PythonSpider, url="https://patents.google.com/patent/US20210272010A1/en?oq=US20210272010A1")
results = processor.run([job])

print(results)`

covuworie commented 2 years ago

@christosmito I thought I had provided a detailed explanation already 😃. There are a few ways to fix this according to the python documentation so I didn't provide a pull request as @vidakDK may want to fix it in a different way.

The main issue is that all the items need to be removed from the queue before the process gets joined otherwise the code deadlocks. So the run method in __init__.py should actually read like this instead:

def run(self, jobs):
        """Start the Scrapy engine, and execute all jobs.  Return consolidated results
        in a single list.
        Parms:
          jobs ([Job]) - one or more Job objects to be processed.
        Returns:
          List of objects yielded by the spiders after all jobs have run.
        """
        if not isinstance(jobs, collections.abc.Iterable):
            jobs = [jobs]
        self.validate(jobs)

        p = Process(target=self._crawl, args=[jobs])
        p.start()
        res = self.results.get()  # This line has moved from the bottom of the method. This is the only change.
        p.join()
        p.terminate()

        return res

Take a copy of the code, try this change and see if it works in your scenario also. It worked for me with more complex scrapy Items (and even with simpler dicts) and a lot of data.

vidakDK commented 2 years ago

@covuworie I think you found the source of the problem and the easiest solution. Your idea also resolved my scenario, so I made a PR with your proposed change.

jschnurr commented 2 years ago

@covuworie I was able to write a failing test based on your findings: https://github.com/jschnurr/scrapyscript/blob/f6eb381c46239ed887fb3b4935cfaf016c7ebd76/tests/test_scrapyscript.py#L112

I'll merge the PR and update the package shortly.