scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.3k stars 217 forks source link

HBaseState flush when frontier stop #310

Open clarksun opened 6 years ago

clarksun commented 6 years ago

flush method in HBaseState invoked every 5 minutes by default settings to save cache state into meta table, add frontier_stop method to prevent state loss from memory cache dict when sw instance terminated.

sibiryakov commented 6 years ago

Good catch @clarksun! I'm going to merge it.

sibiryakov commented 6 years ago

I have checked this code again, @clarksun and found that SW was already running the flush when stopping https://github.com/scrapinghub/frontera/blob/master/frontera/worker/strategy.py#L291, if we apply your patch it will be flushing two times. I propose to remove the state cache flush on stopping from SW then and make sure frontier_stop will be executed.