my8100 / logparser

A tool for parsing Scrapy log files periodically and incrementally, extending the HTTP JSON API of Scrapyd.
GNU General Public License v3.0
89 stars 22 forks source link

fix MemoryError #6

Closed sulthonzh closed 5 years ago

sulthonzh commented 5 years ago

using json dump file stream

my8100 commented 5 years ago

What’ s the difference?

sulthonzh commented 5 years ago

this method directly serialized json to file stream, doesn't use variable/memory to hold json string image

my8100 commented 5 years ago

But the data object still lives in memory after saving it into a file. Would the json.dumps operation double the memory usage comparing to the json.dump?

sulthonzh commented 5 years ago

success parse 1GB file with this method: image image

my8100 commented 5 years ago

How could the size of the json file be 1GB? What’s inside it?

sulthonzh commented 5 years ago

scraper result with huge items and descriptions, lol

my8100 commented 5 years ago

But logparser only keeps the latest item, how can it make the json file so big? Could you inspect the json file to find out the true cause?

sulthonzh commented 5 years ago

only product items like this image

my8100 commented 5 years ago

It seems that there are too many warning logs about dropped items, which make the json file as big as the original log file. Maybe you can use the INFO level for such logs instead, or handle the missing price for scraped items in a better way.

Furthermore, it doesn't make sense to generate that large json file, which burdens both RAM and bandwidth. So for the time being, I would not merge the PR as it's not a common case. And a switch for collecting warning logs may be added in a future release.

my8100 commented 5 years ago

Added LOG_CATEGORIES_LIMIT option in https://github.com/my8100/logparser/commit/349613ef7885bcfeaa94235f527e51f3be0db7dc

# Keep only the last N logs of each item (e.g. critical_logs) in log_categories.
# The default is 10, set it to 0 to keep all.
LOG_CATEGORIES_LIMIT = 10