Open llogiq opened 8 years ago
I'd be interested in helping out with this if I can get a bit more info on how you'd like things organized. I'm familiar with Postgres and SQLite but haven't yet had an excuse go in-depth with Diesel.
Great! This issue is more about the design side of things and probably requires more understanding of the Rust HIR than of database work (with or without Diesel).
You can look at the (very sparse) internal compiler docs at Manish's github page: https://manishearth.github.io/rust-internals-docs/std/index.html – look for the rustc::hir::Expr_
enum, especially the Call
and MethodCall
variants. Also the Generics
struct. This is the model we'll have to...well, model.
How stable is the HIR at the moment? I know there are lots of changes at the moment going on with respect to the MIR-- will any of these have a noticeable impact on NSA's internal representation? In other words, how stable can the schema be at this point in time?
HIR changes slightly around once a week.
I don't think that will be a problem for our schema – Generics aren't likely to change much (until we get HKTs that is – but then I'll gladly extend the schema to make it work :smile:)
Sounds good. I'll try and get a first attempt posted here sometime tomorrow night (I'm in PST).
Looking at rustc::hir::Expr_
I see that there are separate expressions for ExprCall
and ExprMethodCall
. Aside from trait objects calls (which are an admittedly confusing special case) is there any reason to represent method calls (which can be resolved to UFCs) separately from normal function calls?
Yes. Plain Call
s are always statically dispatched, whereas MethodCall
s may be dynamically dispatched (if self
is a trait object).
Sorry, maybe my question was unclear. I was trying to ask if MethodCall
should be kept separate in the case that self
is not a trait object (which I believe we can detect fairly easily). It seems like the semantics for handling trait objects will be different from all other function calls (including non-trait object method calls), which could all be resolved statically.
For example, you list in the README several goals including determining if a function may panic, is pure, is recursive, or allocates memory. Following the call graph to determine any of these properties is relatively simple in the case of statically dispatched functions, but much more complicated in the case of trait objects, where it would involve pointer analysis to determine all the different types a given trait object could possibly point to.
Because of this, the handling of statically dispatched methods is going to be much more similar to that of plain function calls than that of methods on trait objects.
Yes. Luckily, a) dynamic dispatch isn't all that common in Rust and b) for most intents & purposes it's sufficient to look up all possible impls and see if their methods panic, allocate, call back or whatever.
This also means we'll need to store trait impls for later lookup.
Obviously very early, but here's what I've got so far.
The visualizer didn't quite get my foreign keys right, so I'll just explain them here:
Paths foreign key to (nullable) parent paths to create a hierarchical structure (mem
, for instance, would have the parent std
, which would have no parent).
Function calls can be static or not, and this decides whether or not they have a foreign key to a function implementation or a trait function.
Function call arguments are foreign keyed to type, the function call instance, and the patkind (I wasn't sure what information from rustc_front::hir::PatKind
was relevant, so I haven't yet created this).
Function implementations, a path foreign key, a name, and (if they are an implementation of a trait function) a foreign key to the relevant trait and type. Eventually, this is where I forsee we will attach attributes such as whether or not a function can panic, which can then be propagated upwards after all the data has been entered.
Traits currently just have an id, name and path.
Trait bounds are a many-to-many linking traits to their children (i.e. id of Eq -> id of PartialEq).
Trait functions are the functions that need to be implemented in order to satisfy a trait.
The trait implementation lists is another many-to-many linking traits to the types that implement them.
Finally, types have a path, name, and other associated metadata.
Let me know your thoughts-- am I headed in the right direction? Should I be sticking more closely to the exact representation used in HIR? Does anything not make sense?
I left this out in my description, but obviously there will also be a many-to-many piece describing what functions are called by what other functions.
@cramertj Just a heads-up – I renamed the project to metacollect due to some people feeling uneasy about the old name. No need to stoke emotions.
@llogiq Yeah, I saw the Reddit discussion. Some very strong opinions out there... That aside, it seems like a good move as it's a far more descriptive name.
@llogiq Did you get a chance to check out the schema I posted above?
Sorry, not yet. I'll need to catch some sleep first – maybe tomorrow.
This has languished for some time. Sorry for the delays. There have been a few developments in the meantime:
Looking into the schema, I like the recursive path idea. I'd like to see the fn - fn
(function/method call) relation as the central element, this does not seem to be reflected in the schema as of yet.
Types will obviously have to grow generics and substitutions (as outlined above), also I think that trait_bound
is underspecified: What do those id
s refer to?
Can you explain what you mean by "type contents?"
WRE to the specialization concern, I'm wondering exactly how you'd like to handle generics then. It seems to me that the easiest route would be to include monomorphization as a step in the analysis. That would allow us to treat all our static calls as non-generic. Otherwise, we'll have to come up with some way of tracking the flow of types down through the stack.
We could also take the less accurate but more general approach of examining all versions of a function that could occur (based on the trait impls we know about). By starting at the lowest level of fn calls (those that don't call other generic functions) we could propagate our metrics upwards to all generic functions that could potentially call that function (either directly or through a trait impl).
The fn-fn
relation, if I've understood correctly what you mean, is currently represented as the fn_call
relation, with the fn_call_arg
representing a single argument for a single call.
trait_bound
is a many-to-many linker saying that the trait specified by child_id
is bounded by the trait specified by parent_id
. This obviously only works for plain trait bounds (not HRTBs), but it seemed like a decent starting point.
With "type contents" I mean the composition of structs and enums.
I'm wary of monomorphization, because a) there are existential types and trying to monomorphize can give us an exponential of monomorphized types (for example, see https://github.com/paholg/typenum – and there's code out there mishandling the type system in worse ways!) whereas tracking the "flow" of generics (via generics and substitutions, as it is done in the compiler) avoids this pitfall. There are some ugly corner cases though. Still I think it's better to think them through when we encounter them. Better a 90% solution now than a 100% solution never.
Ok, I think I get your idea of trait bounds now. This requires us to mix existential and composed types in one table; we should make sure that we can distinguish those.
So, through flow analysis, the goal is to only analyze a function's metadata for call conditions that may exist in the current code? How should this interact with library crates calling something like into()
or deref()
which rely upon a user or std implementation (which, for all we know, may panic)? Especially for authors of unsafe libraries, this seems to remove the advantage of knowing whether some code could possibly panic. If tacked-on user code could change the properties of these functions, it seems like we're making the utility less useful than it could be.
With into() and detef() calls, we record the exact receiving types, and thus can pinpoint the trait impls. As long as all dependencies (and yes, that probably includes std
) are analyzed by metacollect, we have a complete picture what code may panic where. Also what code is pure, or at least idempotent.
Only if we're analyzing binary code, right? This won't work for library code?
(because not all dependencies of the actual execution context could be analyzed, only the dependencies of the library)
OK, we don't analyze FFI code. We only analyze Rust code for the crate and all dependencies. Since we expect metacollect to run on the whole dependency tree, this should give us a large part of the whole picture.
Perhaps I'm not explaining my question correctly. Yes, we can't work with FFI code. However, when writing a library, there are often exposed APIs which are generic on some input. When calling a trait method on one of these inputs, we can't tell (without having access to the client's crate) what the concrete implementation of the trait method is, and therefore can't tell whether or not it will panic.
For example:
struct Person {
name: String,
}
impl Person {
fn new<S>(name: S) -> Person where S: Into<String> {
Person {
// This into() fn might panic or be impure, but we don't know, because we didn't write it.
name: name.into()
}
}
}
There's really no way to get around this, so far as I can tell, aside from giving some sort of warning about possible client-crate-defined behavior.
That's why we collect (as in "metacollect") the generics and substitutions. Say, metacollect records that Person::new(_)
is generic over a type with a trait bound of Into<String>
. Later it records a client that calls Person::new(_)
with a type whose Into<String>::into(self)
method may panic. We can then conclude that the Person::new(_)
call from this client can panic, too.
Remember, metacollect is not about reporting, but about collection.
Ah, I see. Sorry for the confusion-- I understand now.
Since I'd like to use Diesel with a SQLite database to store the data, I need a database schema that can do the following: