Closed JoeJasinski closed 2 years ago
+1
@JoeJasinski, I was not able to reproduce the problem. However, I've made some other changes based on your feedback; can you please review v1.0.0 and post a complete, non-working example if there is still an issue?
@jschnurr I have the same issue - scrapy job hangs forever on the process join line, here:
.env/lib/python3.6/site-packages/billiard/popen_fork.py in wait(self, timeout)
55 return None
56 # This shouldn't block if wait() returned successfully.
---> 57 return self.poll(os.WNOHANG if timeout == 0.0 else 0)
58 return self.returncode
59
This happens only when I include creation of an HTMLParser for some reason, I assume that some memory is not being released or something.
Here's the offending code that makes the process hang:
def _parse_historical_data(self, response):
data = response.xpath('//tr[@class="text-right"]').extract()
hist_data = []
if data:
for item in data:
p = HistoricalDataHTMLParser()
p.feed(item)
hist_data.append(p.tds)
# p.close()
# del p
return hist_data
class HistoricalDataHTMLParser(HTMLParser):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.in_td = False
self.tds = []
def handle_starttag(self, tag, attrs):
if tag == 'td':
self.in_td = True
def handle_data(self, data):
if self.in_td:
self.tds.append(data.strip())
def handle_endtag(self, tag):
self.in_td = False
calling _parse_historical_data
method causes the above hang. These close()
and del
lines aren't helping.
Note: I've had the same problem happen in another scraper, where some PDF parser was involved which uses the pdfminer.six
library.
UPDATE: I believe that the problem lies in returning the scraped data. If, for any reason, the returned data is bound to Scrapy objects, the memory will not be released and sub-process will never be joined. I solved the problem by simply storing the scraped data to the database in Scrapy instead of returning the data using the "run" method of Job class.
@vidakDK what were you trying to return?
This rather contrived example exhibits the same symptom; the spider runs to completion (logging INFO: Spider closed (finished)
, but never exits:
class TestBadSpider(Spider):
name = 'badspider'
def start_requests(self):
yield scrapy.Request('http://www.python.org')
def parse(self, response):
return {'payload': response}
The same thing happens using Nosetests and Scrapyscript, but not with scrapy runspider
.
What I don't understand is why we would ever get into this situation; why are you returning data bound to Scrapy objects?
Closed due to inactivity.
@jschnurr Well Scrapyscript combines the results of multiple calls into a single return object, which I then wanted to store in the database, in order to split up the scraping and database actions in the code.
However, Scrapyscript was unable to end jobs when its response was tied to outside objects, so I added database storing logic to each call, and removed all data from "return" option of Scrapyscript.
I would either try to edit the logic of those twisted calls to be able to make some deep copies of data which would enable it to kill the process normally, or just remove the return
functionality.
Update: I don't have the time to test this now, but the problem could be related to using Scrapy pipelines. If we test the example from the project readme file, but instead add some basic code in pipelines.py
that returns the data instead of the spider itself, do we get the hanging process problem or does it work normally?
@jschnurr, @vidakDK
I can confirm this issue and also experienced that it has something to do with the size of the data that is returned by
processor.run(job)
I send all scraped data through a pipeline, which maps it to a database and then returns this converted data to the main process. If I remove the last return statement in my pipeline, the problem disappears.
In Addition, the problem only occurs, when I scrape bigger amounts of data. It seems to get critical(on my machine) when the size of returned data exceeds ~ 50000 bytes. This is a totally fine cause to fail, but it would be nice to get some Exception, if it happens.
@vidakDK this does seem to happen when data returned from Spiders is bound to Scrapy objects, as you suggested. As a simple example, returning scrapy.Item()
objects from a basic Spider hangs, while returning a dict
with the same data does not.
For now, I recommend returning dict
from Spiders to workaround the problem.
Same problem here when I try to run from a Flask endpoint. It hangs forever in Processor().run(), even when I try with/without returning values (dict) from the scraper. As a standalone script, It works.
I have encountered the same problem as others here. I took a copy of the code and found that it hangs indefinitely either when enqueing results or dequeing results: e.g on self.results.put(self.items)
or return self.results.get()
. I have seen that this is code in billiard queues.
As others have noted it hangs either when a scrapy.Item
or dict
is returned from the parse
method in scrapy
. So @vidakDK the workaround you mentioned doesn't work in all situations. I think I saw a deadlock Exception in my IDE at somepoint, but I no longer see that and the code just hangs.
Also as noted above by someone it is something to do with the size of the data returned. There is a threshold data size below which the run completes irrespective of whether a scrapy item or dict is returned from the parse
method.
One "workaround" which did work consistently for me when using scrapy items was to update the scrapyscript
code to convert the item to a dict in the _item_scraped
method instead: self.items.append(dict(item))
. The code does not hang with this change. Of course this change is not ideal as the expected return type is being changed from a scrapy item to a dict.
No idea why the code does not hang with this change and does hang if a dict is returned directly from the parse
method in scrapy. Maybe others have an idea what is going on? I feel that this is a critical defect and I'm surprised it hasn't been fixed for many years.
I just had a look at this again and think I've got a fix @vidakDK. The problem seems to be a deadlock. The python
documentation states:
"that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe"
and further says:
"This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined"
and then gives the following example of a deadlock:
from multiprocessing import Process, Queue
def f(q):
q.put('X' * 1000000)
if __name__ == '__main__':
queue = Queue()
p = Process(target=f, args=(queue,))
p.start()
p.join() # this deadlocks
obj = queue.get()
See Joining processes that use queues.
This is the same situation that is in the run
method code. Switching the code around to read like below allows the code to run to run to completion returning the results:
p = Process(target=self._crawl, args=[jobs])
p.start()
# p.join()
# p.terminate()
res = self.results.get()
p.join()
p.terminate()
return res
The code needs to be tested with multiple jobs as I was not looking at that scenario.
More information about the deadlock situation is given in the second pink box in the pipes and queues python documentation. It states an alternative way is to use a manager as using one does not have the same problem.
As an aside, the data size issue seems like it is related to the size of the buffer as described in this stackoverflow thread.
@covuworie Could you provide a more detailed explanation about how to manage to make the code work? I have built a scraper for google patents in order to extract some fields, and when the description fileld is big I take the hang forever situation. My code is this: `import logging
import scrapy from scrapyscript import Job, Processor
processor = Processor(settings=None)
class PythonSpider(scrapy.spiders.Spider): name = "myspider"
def start_requests(self):
yield scrapy.Request(self.url)
def parse(self, response):
data = str(''.join(response.xpath('//section[@itemprop="description"]/descendant::*/text()').getall()).replace('\n', '').replace(' ', '').strip())
return {'title': data}
job = Job(PythonSpider, url="https://patents.google.com/patent/US20210272010A1/en?oq=US20210272010A1")
results = processor.run([job])
@christosmito I thought I had provided a detailed explanation already 😃. There are a few ways to fix this according to the python documentation so I didn't provide a pull request as @vidakDK may want to fix it in a different way.
The main issue is that all the items need to be removed from the queue before the process gets joined otherwise the code deadlocks. So the run
method in __init__.py
should actually read like this instead:
def run(self, jobs):
"""Start the Scrapy engine, and execute all jobs. Return consolidated results
in a single list.
Parms:
jobs ([Job]) - one or more Job objects to be processed.
Returns:
List of objects yielded by the spiders after all jobs have run.
"""
if not isinstance(jobs, collections.abc.Iterable):
jobs = [jobs]
self.validate(jobs)
p = Process(target=self._crawl, args=[jobs])
p.start()
res = self.results.get() # This line has moved from the bottom of the method. This is the only change.
p.join()
p.terminate()
return res
Take a copy of the code, try this change and see if it works in your scenario also. It worked for me with more complex scrapy Items (and even with simpler dicts) and a lot of data.
@covuworie I think you found the source of the problem and the easiest solution. Your idea also resolved my scenario, so I made a PR with your proposed change.
@covuworie I was able to write a failing test based on your findings: https://github.com/jschnurr/scrapyscript/blob/f6eb381c46239ed887fb3b4935cfaf016c7ebd76/tests/test_scrapyscript.py#L112
I'll merge the PR and update the package shortly.
Thank you for writing scrapyscript; it's been very helpful!
However, I have a script that looks something like the below, written in Python 3.5. I noticed that when I run Process().run() as below, the spider runs to completion, but hangs after the scrape is done (which causes my Celery 4 jobs to hang and never finish).
I verified that in the scrapyscript code, in the run() method, processing makes it through p.start(), p.join(), p.terminate(), but hangs on the return statement. I noticed that the return statement is looking up results in a Queue. If I comment out the return statement (I personally don't care about the returned results), the processing finishes.
Impacted area: https://github.com/jschnurr/scrapyscript/blob/master/scrapyscript.py#L118
(As an aside, I also overrode the init() method in my spider to accept arguments, but noticed that I have to pass in the custom arguments to both MySpider and as the payload, but maybe I should open a different ticket for that?)