simsong / bulk_extractor

This is the development tree. Production downloads are at:
https://github.com/simsong/bulk_extractor/releases
Other
1.11k stars 187 forks source link

parallelize imge_iterator #294

Open simsong opened 2 years ago

simsong commented 2 years ago

We're going to need to be able to parallelize the producer. The issue isn't I/O, it's the CPU-heavy stuff that accompanies reading disk images, which is mostly decompression of E01 files.

Unfortunately the libewf library is not t hreadsafe. So one approach is to have an iterator that does even pages or files and one that does odd ones. Two should be enough, but this approach obviously scales.

For better scaling, we could have a second threadpool and have the worker load it with the addresses of the sbufs to process and have threads that read the data and then send it to the next work queue.

jonstewart commented 2 years ago

Since the thread pool is generic—it executes functors—could you put both reading and scanning tasks onto the same threadpool? That way (with likely the need for a bit of smarts about generating the reading tasks) the threadpool could balance reading and scanning.

On Nov 24, 2021, at 8:35 AM, Simson L. Garfinkel @.***> wrote:

 We're going to need to be able to parallelize the producer. This is easy to do for the raw reader and the directory reader, but not for the E01 reader, which will require multiple open files. We can then have a iterator that does even pages or files and one that does odd ones. Two should be enough, but this approach scales. ... Hm. We could have a second threadpool and have the worker load it with the addresses of the sbufs to process and have threads that read the data and then send it to the next work queue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

simsong commented 2 years ago

The old threadpool was generic, but the new one is not. Right now it just takes sbufs, but it's going to take hard-coded work structures. We can modify it to use functors, but this was easier. The previous threadpool had problems, so I just forward-ported the old BE1.6 threadpool.

jonstewart commented 2 years ago

Ah, I hadn’t noticed that you’d outright replaced it.

On Nov 24, 2021, at 8:50 AM, Simson L. Garfinkel @.***> wrote:

 The old threadpool was generic, but the new one is not. Right now it just takes sbufs, but it's going to take hard-coded work structures. We can modify it to use functors, but this was easier. The previous threadpool had problems, so I just forward-ported the old BE1.6 threadpool.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

simsong commented 2 years ago

I wanted to compare the two, but mine is now instrumented and significantly superior in this application.