Database Schema - Githubissues

llogiq commented 8 years ago

Since I'd like to use Diesel with a SQLite database to store the data, I need a database schema that can do the following:

insert and lookup type composition relation (i.e. Type_A ⊂ Type_B) – generics are a complication here
store and query the method call graph – traits and specialization are the big problems for this

cramertj commented 8 years ago

I'd be interested in helping out with this if I can get a bit more info on how you'd like things organized. I'm familiar with Postgres and SQLite but haven't yet had an excuse go in-depth with Diesel.

llogiq commented 8 years ago

Great! This issue is more about the design side of things and probably requires more understanding of the Rust HIR than of database work (with or without Diesel).

You can look at the (very sparse) internal compiler docs at Manish's github page: https://manishearth.github.io/rust-internals-docs/std/index.html – look for the rustc::hir::Expr_ enum, especially the Call and MethodCall variants. Also the Generics struct. This is the model we'll have to...well, model.

cramertj commented 8 years ago

How stable is the HIR at the moment? I know there are lots of changes at the moment going on with respect to the MIR-- will any of these have a noticeable impact on NSA's internal representation? In other words, how stable can the schema be at this point in time?

Manishearth commented 8 years ago

HIR changes slightly around once a week.

llogiq commented 8 years ago

I don't think that will be a problem for our schema – Generics aren't likely to change much (until we get HKTs that is – but then I'll gladly extend the schema to make it work :smile:)

cramertj commented 8 years ago

Sounds good. I'll try and get a first attempt posted here sometime tomorrow night (I'm in PST).

cramertj commented 8 years ago

Looking at rustc::hir::Expr_ I see that there are separate expressions for ExprCall and ExprMethodCall. Aside from trait objects calls (which are an admittedly confusing special case) is there any reason to represent method calls (which can be resolved to UFCs) separately from normal function calls?

llogiq commented 8 years ago

Yes. Plain Calls are always statically dispatched, whereas MethodCalls may be dynamically dispatched (if self is a trait object).

cramertj commented 8 years ago

Sorry, maybe my question was unclear. I was trying to ask if MethodCall should be kept separate in the case that self is not a trait object (which I believe we can detect fairly easily). It seems like the semantics for handling trait objects will be different from all other function calls (including non-trait object method calls), which could all be resolved statically.

For example, you list in the README several goals including determining if a function may panic, is pure, is recursive, or allocates memory. Following the call graph to determine any of these properties is relatively simple in the case of statically dispatched functions, but much more complicated in the case of trait objects, where it would involve pointer analysis to determine all the different types a given trait object could possibly point to.

Because of this, the handling of statically dispatched methods is going to be much more similar to that of plain function calls than that of methods on trait objects.

llogiq commented 8 years ago

Yes. Luckily, a) dynamic dispatch isn't all that common in Rust and b) for most intents & purposes it's sufficient to look up all possible impls and see if their methods panic, allocate, call back or whatever.

This also means we'll need to store trait impls for later lookup.

cramertj commented 8 years ago

Obviously very early, but here's what I've got so far.

The visualizer didn't quite get my foreign keys right, so I'll just explain them here:

Paths foreign key to (nullable) parent paths to create a hierarchical structure (mem, for instance, would have the parent std, which would have no parent).

Function calls can be static or not, and this decides whether or not they have a foreign key to a function implementation or a trait function.

Function call arguments are foreign keyed to type, the function call instance, and the patkind (I wasn't sure what information from rustc_front::hir::PatKind was relevant, so I haven't yet created this).

Function implementations, a path foreign key, a name, and (if they are an implementation of a trait function) a foreign key to the relevant trait and type. Eventually, this is where I forsee we will attach attributes such as whether or not a function can panic, which can then be propagated upwards after all the data has been entered.

Traits currently just have an id, name and path.

Trait bounds are a many-to-many linking traits to their children (i.e. id of Eq -> id of PartialEq).

Trait functions are the functions that need to be implemented in order to satisfy a trait.

The trait implementation lists is another many-to-many linking traits to the types that implement them.

Finally, types have a path, name, and other associated metadata.

Let me know your thoughts-- am I headed in the right direction? Should I be sticking more closely to the exact representation used in HIR? Does anything not make sense?

cramertj commented 8 years ago

I left this out in my description, but obviously there will also be a many-to-many piece describing what functions are called by what other functions.

llogiq commented 8 years ago

@cramertj Just a heads-up – I renamed the project to metacollect due to some people feeling uneasy about the old name. No need to stoke emotions.

cramertj commented 8 years ago

@llogiq Yeah, I saw the Reddit discussion. Some very strong opinions out there... That aside, it seems like a good move as it's a far more descriptive name.

cramertj commented 8 years ago

@llogiq Did you get a chance to check out the schema I posted above?

llogiq commented 8 years ago

Sorry, not yet. I'll need to catch some sleep first – maybe tomorrow.

llogiq commented 8 years ago

This has languished for some time. Sorry for the delays. There have been a few developments in the meantime:

We don't need to track type contents, because rustc already gives us that information even across crates
I managed to play with specialization a little and came to the conclusion that we will need to handle generics and substitutions (because the impls can vary based on types)

Looking into the schema, I like the recursive path idea. I'd like to see the fn - fn (function/method call) relation as the central element, this does not seem to be reflected in the schema as of yet. Types will obviously have to grow generics and substitutions (as outlined above), also I think that trait_bound is underspecified: What do those ids refer to?

cramertj commented 8 years ago

Can you explain what you mean by "type contents?"

WRE to the specialization concern, I'm wondering exactly how you'd like to handle generics then. It seems to me that the easiest route would be to include monomorphization as a step in the analysis. That would allow us to treat all our static calls as non-generic. Otherwise, we'll have to come up with some way of tracking the flow of types down through the stack.

We could also take the less accurate but more general approach of examining all versions of a function that could occur (based on the trait impls we know about). By starting at the lowest level of fn calls (those that don't call other generic functions) we could propagate our metrics upwards to all generic functions that could potentially call that function (either directly or through a trait impl).

The fn-fn relation, if I've understood correctly what you mean, is currently represented as the fn_call relation, with the fn_call_arg representing a single argument for a single call.

trait_bound is a many-to-many linker saying that the trait specified by child_id is bounded by the trait specified by parent_id. This obviously only works for plain trait bounds (not HRTBs), but it seemed like a decent starting point.

llogiq commented 8 years ago

With "type contents" I mean the composition of structs and enums.

I'm wary of monomorphization, because a) there are existential types and trying to monomorphize can give us an exponential of monomorphized types (for example, see https://github.com/paholg/typenum – and there's code out there mishandling the type system in worse ways!) whereas tracking the "flow" of generics (via generics and substitutions, as it is done in the compiler) avoids this pitfall. There are some ugly corner cases though. Still I think it's better to think them through when we encounter them. Better a 90% solution now than a 100% solution never.

Ok, I think I get your idea of trait bounds now. This requires us to mix existential and composed types in one table; we should make sure that we can distinguish those.

cramertj commented 8 years ago

So, through flow analysis, the goal is to only analyze a function's metadata for call conditions that may exist in the current code? How should this interact with library crates calling something like into() or deref() which rely upon a user or std implementation (which, for all we know, may panic)? Especially for authors of unsafe libraries, this seems to remove the advantage of knowing whether some code could possibly panic. If tacked-on user code could change the properties of these functions, it seems like we're making the utility less useful than it could be.

llogiq commented 8 years ago

With into() and detef() calls, we record the exact receiving types, and thus can pinpoint the trait impls. As long as all dependencies (and yes, that probably includes std) are analyzed by metacollect, we have a complete picture what code may panic where. Also what code is pure, or at least idempotent.

cramertj commented 8 years ago

Only if we're analyzing binary code, right? This won't work for library code?

cramertj commented 8 years ago

(because not all dependencies of the actual execution context could be analyzed, only the dependencies of the library)

llogiq commented 8 years ago

OK, we don't analyze FFI code. We only analyze Rust code for the crate and all dependencies. Since we expect metacollect to run on the whole dependency tree, this should give us a large part of the whole picture.

cramertj commented 8 years ago

Perhaps I'm not explaining my question correctly. Yes, we can't work with FFI code. However, when writing a library, there are often exposed APIs which are generic on some input. When calling a trait method on one of these inputs, we can't tell (without having access to the client's crate) what the concrete implementation of the trait method is, and therefore can't tell whether or not it will panic.

For example:

struct Person {
    name: String,
}

impl Person {
    fn new<S>(name: S) -> Person where S: Into<String> {
        Person {
            // This into() fn might panic or be impure, but we don't know, because we didn't write it.
            name: name.into() 
        }
    }
}

There's really no way to get around this, so far as I can tell, aside from giving some sort of warning about possible client-crate-defined behavior.

llogiq commented 8 years ago

That's why we collect (as in "metacollect") the generics and substitutions. Say, metacollect records that Person::new(_) is generic over a type with a trait bound of Into<String>. Later it records a client that calls Person::new(_) with a type whose Into<String>::into(self) method may panic. We can then conclude that the Person::new(_) call from this client can panic, too.

Remember, metacollect is not about reporting, but about collection.

cramertj commented 8 years ago

Ah, I see. Sorry for the confusion-- I understand now.

llogiq / metacollect

Database Schema #2