sselph / scraper

A scraper for EmulationStation written in Go using hashing
MIT License
449 stars 88 forks source link

Improve RAM management #61

Closed substring closed 8 years ago

substring commented 8 years ago

On Pi2 with1000+ roms systems, the scraper crashes due to a lack of memory. Using 4 workers

sselph commented 8 years ago

hmm I've done a bunch to make sure the scraper is fairly efficient but the only place it may eat a lot of memory is that is doesn't write the XML until it is finished which means it would hold the entire thing in memory. That shouldn't use 1GB of memory so maybe there is a leak somewhere.

To make sure I understand before running some profiling, you are scraping a system with 1000+ roms in it with -workers 4 and no other flags? Are these console or mame/fba?

sselph commented 8 years ago

Actually if you can reproduce this it would be helpful to run the script with the flag -start_pprof

Then after it is getting to the point where it is starting to leak memory go to this URL in the browser. You can replace localhost with the IP of the raspberry pi if you don't have a GUI and browser on your pi. http://localhost:8080/debug/pprof/heap?debug=1

This will give you a page with the heap information which will help me pinpoint where the issue is. If you could get that to me along with the version of the scraper you are running I should be able to see exactly where the memory is being used.

sselph commented 8 years ago

I added a change that will only allocate a 4MB buffer per worker for reading files for hashing. Before it was allocating 4MB per file but the memory was being released then reused. I don't know if this will help with memory usage but may help performance since it doesn't have to reallocate a buffer each time.

I'll roll this out in a release soon but still looking forward to a heap if you have one.

substring commented 8 years ago

Hey ! I haven't forgotten you. But been quite busy lately and I haven't had time yet to get you the proper information. But it's still in my scope since several people faced that case

substring commented 8 years ago

got "bad news" : I can't connect to the URL you gave me. I'm not on a raspbian or such distro, i'm on recalbox which is built from scratch. Running 1.0.8

My command line : /tmp/scraper -no_thumb=true -max_width=375 -rom_dir=/recalbox/share/roms/snes -output_file=/root/.emulationstation/gamelists/snes/gamelist.xml -workers=4 -image_dir=/root/.emulationstation/downloaded_images/snes -image_path=/root/.emulationstation/downloaded_images/snes -start_pprof

When it crashes, i have barely 4MB of RAM left

While it's runninng on my Pi2, i can see it takes about 4MB RAM per second, scraping 4 roms per second. Before starting the scraper, I have 420MB free, the scraper eats everything and crashes.

I used the SNES no intro rom set. Not that hard to find ...

What else can I do for you ?

sselph commented 8 years ago

Odd so you couldn't connect to this address from another machine? http://:8080/debug/pprof/heap

The other option would be to ssh in from a second terminal and run: $ wget http://localhost:8080/debug/pprof/heap That would create a file named heap in the folder where you ran it.

I don't have any of the rom sets but I'll try generating a large set of files with random data on the order of the number of roms in the snes nointro set and see if I can recreate this. I'll also try running this on an arm, maybe something is different there.

substring commented 8 years ago

Your first URL was different and didn't work. Now I could get the result : http://pastebin.com/Ss86ivEr

forgot to mention : we have no swap. But anyway, the scraper ate more than 400MB of RAM, that's quite much

sselph commented 8 years ago

Thanks! Seems to be something with the image processing. I'll do some more investigating and see if I'm using that library incorrectly.

sselph commented 8 years ago

Okay after taking a look there isn't any issue with the code. The memory is being allocate to process the jpeg and processing 4 full size images at a time is taking too much memory. I ran this on my desktop and the memory footprint seems to top out around ~500MB with 4 threads(avg ~125MB per thread). To help I added a new flag (I know I already have too many flags) it is -img_workers. This flag limits the number of workers allowed to have an image decompressed at the same time. By default this will be the same as -workers but you can set this to 1 or 2 and should be able to stay within the memory limits of the pi2 and still hash and download information with 4 workers.

This will be in release v1.0.9 shortly.

substring commented 8 years ago

Testing 1.0.9 at the moment Perfs are divided by 10, if not more, when running with a single image worker. The scraper is unusable in such configuration. 2 image workers are much better, still have to test the memory impact. Then i'm really curious about memory management, whether it's the scraper itself or the image management part. I don't have swap on the recalbox system. But I notice that the scraper eats up absolutely all the RAM available. What can take so much space ? Images are barely 3MB, or let's say 10MB. A gamelist.xml is not that big. And even, it's xml, it's easy to append new elements, so no need to keep it in memory ... Whereas jpgs are writton to the disk, the file is in memory until it's finished. Scraping hundreds of roms that have a big description will have a big memory impact, why not write to the disk straight away ?

sselph commented 8 years ago

A high quality jpeg can be compressed ~15:1(lower qualities even higher) over the raw image so when it is decompressed the 10MB file takes up 150MB. There might be more memory efficient ways to resize jpeg but this method works on all image types supported by the Go language with a few lines of code. This is why I have the recommendation of using -thumb_only when running on the raspberry pi. It downloads a much smaller initial image so the amount of RAM required to work on it is much less. The other option is run this on a machine other than a pi2 and copy the files over. This is what I did. You can set the path where things are locally using the the _dir flags and _path control the paths written to the xml file.

substring commented 8 years ago

Considering the scraper doesn't crash anymore on big roms list, i guess the solution brought with https://github.com/sselph/scraper/releases/tag/v1.0.9 is the best one