scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.13k stars 144 forks source link

selector create_root_node memory issues #210

Closed GeorgeA92 closed 1 year ago

GeorgeA92 commented 3 years ago

I have concerns about memory efficiency of create_root_node function used for creating root lxml.etree for Selector:

https://github.com/scrapy/parsel/blob/7ed4b24a9b8ef874c644c6cec01654539bc66cc3/parsel/selector.py#L47-L55 And especially this line: https://github.com/scrapy/parsel/blob/7ed4b24a9b8ef874c644c6cec01654539bc66cc3/parsel/selector.py#L50

Steps for reproduce I made.. scrapy spider (tested on zyte scrapy cloud stack scrapy:2.4). It include:

spider code ```python import sys from importlib import import_module import scrapy class MemoryCheckSpider(scrapy.Spider): name = 'parsel_memory' def get_virtual_size(self): size = self.resource.getrusage(self.resource.RUSAGE_SELF).ru_maxrss if sys.platform != 'darwin': # on macOS ru_maxrss is in bytes, on Linux it is in KB size *= 1024 return size def check_memory_usage(self, i): self.logger.info(f"used memory on start: {str(self.get_virtual_size())}") input = i*1000*1000*10 #should be ~95 megabytes or ~100 000 000 bytes self.logger.info(f"size of input: {str(sys.getsizeof(input))}") self.logger.info(f"used memory after input: {str(self.get_virtual_size())}") output = input.strip().replace('\x00', '').encode('utf8') # < - checking this code line self.logger.info(f"used memory after output: {str(self.get_virtual_size())}") def start_requests(self): try: self.resource = import_module('resource') except ImportError: pass self.check_memory_usage('ten__bytes') ```

self.check_memory_usage('ten__bytes')

9:  2021-02-17 15:43:27 INFO    [parsel_memory] used memory on start: 61562880
10: 2021-02-17 15:43:27 INFO    [parsel_memory] size of input: 100000049
11: 2021-02-17 15:43:27 INFO    [parsel_memory] used memory after input: 169148416
12: 2021-02-17 15:43:27 INFO    [parsel_memory] used memory after output: 259026944

In this case input doesn't have space symbols for .strip() and \x00 for .replace() - that functions returned original input. .encode('uft8') -> converted it to bytes and on original create_root_node function -> allocated required memory it's body variable.

self.check_memory_usage(' ten_bytes')

9:  2021-02-17 15:45:33 INFO    [parsel_memory] used memory on start: 61681664
10: 2021-02-17 15:45:33 INFO    [parsel_memory] size of input: 100000049
11: 2021-02-17 15:45:33 INFO    [parsel_memory] used memory after input: 169275392
12: 2021-02-17 15:45:34 INFO    [parsel_memory] used memory after output: 359186432

new input starts with space ` symbol andstrip()function created newstr(for immutable data types likestrorbytes- result of each operation generate new object and require new memory allocation (if input is not equals to output). As result we can see that allocated memory increased by ~1 input size comparing to previous (ten__bytes`) check.

Additional note about replace('\x00', '')

when scrapy creates response.text str from response.body bytes to create Selector object in practice that converted str have around ~2x memory size of response.body because python converts single byte symbols like \x00 into 4-byte unicode symbol:

>>> len(b'\x00')
1
>>> len('\x00')
1
>>> len(str(b'\x00'))
7
>>> len(str(b'\x00\x00'))
11

And of course It also affecting memory usage:

>>> from sys import getsizeof
>>> getsizeof(b'\x00'*1000)
1033
>>> getsizeof(str(b'\x00'*1000))
4052
Hovewer:
>>> getsizeof(str(b'\x00'*1000, encoding='utf8'))
1049

Counting the fact that response.body have around ~5x original response sizes after unzip in httpcompression middleware. It means that response.text used for creating selector (and text for create_root_node function) - is str with memory size around ~10 times more that original http response.

And finally bytes output after replace('\x00', '') used to call lxml.etree.fromstring looks like original response.body bytes just after httpcompression middleware.

Is it really necessary to convert response.body bytes to str (when creating Selector object) and again convert it back to bytes to create etree inside create_root_node?

In this case Selector object probably should accept bytes as input and directly call lxml.etree.fromstring with response.body arg without additional bytes to str and str to bytes convertations and it's memory intensive consequences.

pawelmhm commented 3 years ago

Great work finding this out @GeorgeA92.

I found out this problematic line was added in this commit: 6fe82c621385cc194182fb525718195eafc4568e I'm not sure what were the reasons to add it.

I created even smaller sample to reproduce it:

# sample.py
from lxml import html

from parsel.selector import create_root_node
import resource

def mem():
    x = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    print(x)
    return x

def test_create_root(text):
    parser = html.HTMLParser
    mem()
    result = create_root_node(text, parser)
    mem()
    assert result is not None

test_create_root("i" * 10 * 1000 * 1000 * 10)

Now testing with current implementation

 python sample.py 
130668
430176

And test after removing problematic line.

python sample.py 
130696
332400

So just removing this single line gives us 20% improvement in memory usage.

But it looks like removing null characters was added here: aa4cc76e7eeeda686c99375f7403846cf2c9399c to address this problem: https://github.com/scrapy/parsel/issues/123

re

Is it really necessary to convert response.body bytes to str (when creating Selector object) and again convert it back to bytes to create etree inside create_root_node?

I see currently Selector gets string not bytes, usage of parsel can be outside Scrapy and response.body handling. So forcing people to pass bytes will be breaking change for them.

Selector(b'a')
# TypeError: text argument should be of type <class 'str'>, got <class 'bytes'>

perhaps we could add api to pass bytes and use that in Scrapy? So that this function will accept bytes and string and if it is a bytes object it won't perform memory expensive call to encode(). This might fix our memory usage problems in Scrapy and for people passing strings in other applications it wont break things. Or we could force people to pass bytes to Parsel, but that seems more drastic.

GeorgeA92 commented 3 years ago

@pawelmhm

So just removing this single line gives us 20% improvement in memory usage.

I think that real impact of this is much more than 20%

At this moment I have following estimations of allocated memory per Response(body) in scrapy: 1.from download handler (received from twisted side) - ~1r (bytes) (1 original response body size or 1r) 2.from DownloaderStats issue https://github.com/scrapy/scrapy/issues/4964 ~1r (bytes), ~2 in total. 3.after from httpcompression downloadermiddleware https://github.com/scrapy/scrapy/issues/4797 (accessible from spider parse methods as response.body) ~5r (bytes), ~7r in total 4.when creating response.text from response.body for creating selector ~10r (str), ~17r in total
5.create root node, after .strip() ~10r (str), 27r in total. 6.create root node, after replace('\x00', '').encode('utf8') ~5r (bytes), ~32r in total

result of step 6 - used for creating root node. In total to process single html response (on worst case) it is required to allocate ~32 times more memory than original response body.

In case if we will use result of step 3 to create root node - we can reduce amount of allocated memory from 32r (22r) to 7r (to 6r after fix of https://github.com/scrapy/scrapy/issues/4964).

kmike commented 3 years ago

Hey @GeorgeA92! That's a nice analysis, but I think we should clarify some parts here.

1) We're talking about peak memory usage. get_virtual_size is not returning amount of currently allocated memory, it returns a maximum which the process used over its lifetime; e.g. this number can't decrease.

2) What is good is that Scrapy is single-threaded (well, not exactly, but for the purpose of our discussion it is). So it can be only one create_root_node running at a time, even if Scrapy is downloading many responses in parallel.

3) Python uses ref counts, not only GC. This means that copies of response body shouldn’t survive returning from their functions, and can be deallocated even earlier. E.g. if create_root_node makes 6 copies, and some middleware makes 3 copies, peak usage is expected to be 6 copies or less, not 9, because everything should be deallocated on function exits or earlier, when reference counter goes to 0. I don't think you can add allocations happening in different code blocks to compute the peak. There are 2 exceptions to this, when "32x" number may become true: pypy (which doesn't use ref counts, and relies only on GC), and generators or other code which keeps object alive.

4) If we pass 100MB input to create_root_node, 350MB used by string functions would be the least of our problems; I'd expect lxml parsed tree to use tens of GB of RAM in this case.

5) We need to do bytes -> str -> bytes conversion because these bytes are not the same bytes. Scrapy detects web page encoding, and then normalizes it to utf8 for lxml, to make it work properly.

GeorgeA92 commented 3 years ago

Hello @kmike. Thank You for feedback.

1.We're talking about peak memory usage. get_virtual_size is not returning amount of currently allocated memory, it returns a maximum which the process used over its lifetime; e.g. this number can't decrease.

I didn't try to calculate peak memory usage. I used get_virtual_size - because it's directly the same as on MemoryUsage extension. As far as I understand - result of this function is used to stop spiders on scrapy cloud - this is main reason of usage of this method for tests.

5.We need to do bytes -> str -> bytes conversion because these bytes are not the same bytes. Scrapy detects web page encoding, and then normalizes it to utf8 for lxml, to make it work properly.

From my experience the most of websites (including... bignames websites) return responses with something like content-type: text/html;charset=UTF-8 in headers. In this case conversion bytes -> str -> bytes is not needed because original input is already utf-8. I don't know exact percentage of websites that return responses with other encodings.

kmike commented 3 years ago

As far as I understand - result of this function is used to stop spiders on scrapy cloud - this is main reason of usage of this method for tests.

I see, this makes sense. Currently MemoryUsage extension checks peak memory every N seconds, and stops the spider if a peak was larger than a specified amount.

The idea behind the extension is to prevent OOMs, and stop gracefully. For example, if hard memory limit is 5GB, scrapy's limit may be set to 4GB. Let's say Scrapy's consumption is growing linearly with the amount of requests scheduled (e.g. because of dupefilter), and there are regular 1.5GB spikes caused by temporary allocations. If Memory extension would be measuring current memory usage, it is unlikely it'd see the spike, so it'll let memory grow beyond 3.5GB, and then a spike might kill the process. However, you use max memory, as Scrapy's extension does, a spider would be stopped effectively at 2.5GB, when a spike happens, preventing OOM.

Obviously, this is not a silver bullet. Next spike can be larger than spikes seen before, and a buffer between hard limit and Scrapy limit may be not enough to account for this. It also has an efficiency issue: we don't use some of the memory which could have been used, to have a "buffer" for future spikes. It you're really-really tight on memory, it might be possible to tune the extension behavior to match behavior of your spider better. In some cases it may be ok to disable it completely, if you know you can survive data issues caused by OOMs, and need a more memory.

In this case conversion bytes -> str -> bytes is not needed because original input is already utf-8.

A good point. It might be possibile to add an optimization for this case, though the implementation and testing may be a bit tricky (how do we ensure that we won't break it in future?). As for the implementation, for me focusing on removing str-> bytes conversion looks easier than trying to remove bytes -> str conversion, as Scrapy may be computing response.text anyways, e.g. because some middleware uses it.