simsong / be20_api

API for bulk_extractor version 1.3
Other
12 stars 9 forks source link

Define an object-oriented scanner interface #28

Open jonstewart opened 3 years ago

jonstewart commented 3 years ago

The concept of a scanner was built to be simple—just define a single function!

But, that function must adhere to a certain protocol. It must check the phase and then take appropriate action. These phases and actions correspond to lifetime concerns—initialization, config check, doing work, destruction. This is a good indication that there's an object-oriented design waiting to be defined for scanners, where a new scanner would be a class that inherits from a scanner base class. Inheritance from the base class would likely use the CRTP to provide for a common factory/construction method.

A wrinkle of lifetime management in a multithreaded processing application is that the first scanner may be initialized for the purpose of one-time setup and/or configuration/command-line checking, but then a scanner object of each type may be created for each thread in the thread pool. Typically this secondary initialization can be done via a clone() method that takes a "prototype" instance and then simply invokes the copy constructor. In this way a scanner instance can have state which is reused from sbuf to sbuf, generally to avoid reallocation but also perhaps useful when doing some kind of accumulate/reduce operation (like creating histograms)—each thread's instance can create its result of the partition of sbufs it receives and then those partial results can be aggregated at the end of processing, thread-safe and very efficient (presumably the operations would be associative and commutative).

While complicated framework APIs aren't fun, bulk_extractor isn't some terrible Java framework with a bewildering number of classes. A very simple base class API could be defined for scanners. The resulting code for scanners would be then be cleaner, simpler, and more idiomatic.

It may be possible to define such an interface that's then used by the existing function-based API, to avoid an all-or-nothing refactoring.

simsong commented 3 years ago

bulk_extractor does have an object-oriented interface. What it doesn't have is object-oriented linkage down to the scanners. However, it has a clearly defined object-based API, in which an object is provided as the sole argument to the scanner function. The scanner function is called with "C" linkage. And the scanner calls object interfaces on objects that are passed in to the function.

The reason that the linkage is "C" and not C++ is two-fold. First and most important, it allows scanners can be loaded dynamically on a variety of platforms. The goal was for bulk_extractor to be run as a drop-in program in a certified environment, but to allow end-users to write their own modules and have them load at run-time.

The second reason is that I tried to do run-time loading of C++ shared libraries and I couldn't get it to work in a way that was reliable.

A third reason is that the current "C" linkage is easier to make work with things like Lambda, easier to call between languages (I was able to call C functions from Python, but not C++ functions, due to name mangling; YMMV), and the like.

However, you can certainly write a class that supports the C linkage but the uses virtual functions to call each of its scanner parts. This is left as an exercise for the reader. I've written a few of these; one of them is used for the flex-based scanners, but they don't replace the actual linkage section.

While working on this C++17 rewrite, I realized that the a scanner linkage prototype could be written with the C++ template system. Then you would instantiate the linkage for each of your named scanners and subclass it. Again, I tried some prototypes, but then gave up on it.

I have a very limited amount of time to work on this project, so I'm putting my efforts into making sure that BE2 supports C++17 and thereby supports modern operating systems. It's been a lot of work. I've done a complete refactoring of the BE13_API system and it is much cleaner now. At some point in the future this might be worth doing.

jonstewart commented 3 years ago

Agreed that C++ linkage across shared libraries is harmful/impossible. I always create C APIs when linking, and just try to make bigger libraries to minimize build products and keep the Interface:Implementation ratio small.

I'm surprised that passing in the scanner_params object works—I don't think it would necessarily work between libraries built by different compilers, or all language runtimes.

Most classes here in be13_api are much improved, but the basic API for writing a new scanner is similar. I think most developers who can write C/C++ would find implementing ~4 C functions (create/clone/do/destroy) easier than managing the phases coming from different entry points. And then if the scanners were all part of the executable build, that could be handled through inheritance from a base class—even easier.

That's my main feedback with the API. I'm still having a hard time wrapping my head around all the sbuf lifetime issues; I'll need to get my hands dirty, and then see if any PRs result.

As an aside with lambdas, a nuance that's easy to forget about—because things still compile even if completely wrong—is needing to use std::ref and std::cref with reference parameters. There's a decent example halfway down this page: https://www.learncpp.com/cpp-tutorial/lambda-captures/. Not sure whether you've encountered that, but forgetting to use std::ref accounts for the majority of bugs I've seen with lambdas.

simsong commented 3 years ago

Passing the scanner_params works! It’s just a pointer to an in-memory structure, and the structure’s binary layout is part of the contract.

Our only lambdas are in the multi-threading, and I think that it’s clean. sbufs can be created and destroyed in any thread, and are frequently created in one thread and destroyed in another. That’s why I’m using new and delete, rather than unique_ptr. I couldn’t get the unique stuff to work properly.

Glad you approve of the cleaned up API.

I think that it’s more important for me to get validation of the existing scanners working than for me to write the clean sub-classable API. People haven’t had problems cloning existing scanners to create new ones. And I’ve even had summer interns write scanners. The big innovation is having the sbuf class mediate more of the memory access; it makes the scanners safer.

On Jul 2, 2021, at 11:51 AM, Jon Stewart @.***> wrote:

Agreed that C++ linkage across shared libraries is harmful/impossible. I always create C APIs when linking, and just try to make bigger libraries to minimize build products and keep the Interface:Implementation ratio small.

I'm surprised that passing in the scanner_params object works—I don't think it would necessarily work between libraries built by different compilers, or all language runtimes.

Most classes here in be13_api are much improved, but the basic API for writing a new scanner is similar. I think most developers who can write C/C++ would find implementing ~4 C functions (create/clone/do/destroy) easier than managing the phases coming from different entry points. And then if the scanners were all part of the executable build, that could be handled through inheritance from a base class—even easier.

That's my main feedback with the API. I'm still having a hard time wrapping my head around all the sbuf lifetime issues; I'll need to get my hands dirty, and then see if any PRs result.

As an aside with lambdas, a nuance that's easy too forget about—because things still compile even if completely wrong—is needing to use std::ref and std::cref with reference parameters. There's a decent example halfway down this page: https://www.learncpp.com/cpp-tutorial/lambda-captures/ https://www.learncpp.com/cpp-tutorial/lambda-captures/. Not sure whether you've encountered that, but forgetting to use std::ref accounts for the majority of bugs I've seen with lambdas.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simsong/be13_api/issues/28#issuecomment-873095896, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMFHLAE56Q2JQWRXFB3JFLTVXOBHANCNFSM47V4IELA.

simsong commented 2 years ago

Do you think that this is still necessary? It's certainly a hassle to create all of the things needed for a new scanner, but I'm not sure that it's worth doing this.

jonstewart commented 2 years ago

That’s a great question. I think the answer is yes, but when you “need” to can be at a time of your choosing.

I was thinking about this yesterday. I need to sit down and scribble down the code and then post here, but I think you could come up with a simple base class API, and then a small bit of template function/macro wizardry to generate the scan function for a class that implements the scanner base class. The upshot would be that you wouldn’t need to change the core code or old scanners, at least not right away.

The benefit of defining a class-based scanner API is fostering more contributions to bulk_extractor and making it easier to use as a research platform. A base class makes the contract of a scanner very obvious and accessible.

It also would be a great way of solving thread safety problems:

The scan function would be a template function and then it’d just be typed off the scanner class.

Jon

On Feb 10, 2022, at 9:05 PM, Simson L. Garfinkel @.***> wrote:

 Do you think that this is still necessary? It's certainly a hassle to create all of the things needed for a new scanner, but I'm not sure that it's worth doing this.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.

simsong commented 2 years ago

I can certainly create a base class from which the scanners are subclassed. My first concern of this, though, is that I don't know how it would impact turning the scanner into a loadable shared library.

I hadn't thought of doing it with templates. I guess it could be done entirely with templates, but it's going to get gross.

jonstewart commented 2 years ago

What I’m thinking of is just a helper class abstraction that piggybacks on the scan function. It would be for writing scanners in C++, and wouldn’t affect loading plugins from shared libraries.

I will post a sketch here once I have fingers on keyboard.

On Feb 12, 2022, at 1:17 PM, Simson L. Garfinkel @.***> wrote:

 I can certainly create a base class from which the scanners are subclassed. My first concern of this, though, is that I don't know how it would impact turning the scanner into a loadable shared library.

I hadn't thought of doing it with templates. I guess it could be done entirely with templates, but it's going to get gross.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.

simsong commented 2 years ago

Right. That would be easy to do. But you still need a way to get the scanner registered into either the scanner array or registered at runtime.

simsong commented 10 months ago

Is this still an issue?