imperva / incapsula-logs-downloader

A Python script for downloading log files from Incapsula
MIT License
30 stars 35 forks source link

purpose of res.wait(15) in LogsDownloader.py #75

Closed AVitg closed 7 months ago

AVitg commented 7 months ago

Hi @joeymoore , i'm trying to speed up the Downloading and file processing a bit. in LogsDownloader.py in the function start_log_processing(self) i stumbled across this lines:

res = self.pool.apply_async(self.handle_file, (log_file_name,), callback=self.update_index)
res.wait(15)

I try to understand the reason for firing this call asynchronous and then waiting for it. I'm still new to threads/pools in python - so bear with me, if the answer is obvious.

cheers A

joeymoore commented 7 months ago

hey @AVitg, recalling back and I think there were some testing scenarios that the thread either hung or took too long waiting for the sub process of downloading the log. if you're looking to download quicker, I'd recommend increasing the number of workers(processors) are allowed; by default the ThreadPool starts one process per CPU. You can increase this by updating line 65 in LogDownloader.py with something like pool = ThreadPool(16) This print statement will show how many you have running now: print(f"Processing threads {pool.getattribute('_processes')}")

AVitg commented 7 months ago

@joeymoore
it does not max out my cpu... ;( from the seperation you are already doing with "file_watcher.watch_files"... i copied LogsDownloader.py, to a new LogsProcessor.py (for lazy testing) and made the "main" to only watch the process folder for new files. I also removed the file watcher from the init part of the original LogsDownloader. in my dev Logs downloader I also use asyncio and httpx for downloading the files, instead of threadpools. I'm not yet sure, if it processes the files faster... however with combining async (for downloading) and pools (for processing), i always had the feeling either or is running (but tbh. its more my coding skills, i assume). however i thought about, why to keep downloader and watcher fighting for cpu/mem... whatever, and separated them. as said, my version if more dirty than quick, hence im not confident sharing this as of now, however, i thought you might have a thought on this?

Cheers A