Open Boshen opened 9 months ago
The size of these two types (either 16 or 32 bytes) should be dependent on real world statistics of the length of identifiers.
I'm actually not sure how much of an optimization inlining will be. The main advantage of compact_str
's CompactStr
vs std's String
in most use cases is that CompactStr
often avoids allocating. But that doesn't apply in OXC, because Atom<'a>
can just reference a slice of the source text - so no allocation either way.
Maybe it will still be a solid gain due to cache locality and faster string comparison, but I think we'd need to measure to know either way.
That could inform what ideal size is - make it bigger to inline more often vs smaller to reduce size of AST.
After some consideration, I believe we will be satisfied with the current implementation for a significant period. Although it's tempting to pursue further optimization, we should direct our energy toward endeavors with broader impact.
pub enum Atom<'a> {
Arena(&'a str),
Compact(CompactString),
}
OK cool. I do have a WIP already, so may finish it up when I get a chance, but yeah, the main problems (memory leak and unsoundness) are now solved.
Reopening since you have a WIP.
I think I got what you mean now.
Atom<'a>
is a pointer to the allocator. We should keep Atom<'a>
on the stack for for small string optimization, and it's an O(1) operation when converting to CompactString
.
Smart.
Yes, Atom<'a>
is either:
Atom<'a>
itself.So, yes, it's an O(1) operation (and a very cheap one) converting an inline Atom<'a>
to an inline CompactStr
(when string is 16 bytes or less) - because all the data is inline.
However, if the string is longer than 16 bytes, conversion is O(n) because the string content will need to be copied to general heap, so that CompactStr
can own that heap allocation. The longer the string, the higher the cost.
But my guess is that the majority of strings will be 16 bytes or less (because they are predominantly identifiers, which tend to be quite short in JS), so it'll be O(1) mostly.
If OXC transformer/semantic/etc can just use Atom<'a>
, without converting to CompactStr
, then zero conversion cost and no heap allocations at all. The downside is that then you have to deal with the 'a
lifetime.
I'd like to mention that:
Atom<'_>
to CompactStr
is O(1)CompactStr
to Atom<'_>
is O(n), require allocation on allocator. Writing back data to ast is very common operations in transfomer and bundler.@hyf0 Just to expand on what you've said:
Atom<'_>
to CompactStr
is O(1) and very cheap.CompactStr
to Atom<'_>
is O(1) and very cheap.Atom<'_>
to CompactStr
is O(n).CompactStr
to Atom<'_>
is O(n).Clone
for Atom<'_>
is O(1).Clone
for CompactStr
is O(n)."Inline" means any string 16 bytes or less. The "n" in "O(n)" is the length of the string.
Only difference between Atom<'a>
and CompactStr
is that CompactStr
owns a heap allocation storing the string data, whereas Atom<'a>
references string data in the arena allocator - the allocator owns the data, and Atom<'a>
just references it ('a
is lifetime of the allocator).
It is unavoidable that there's a cost to copying strings from general heap to arena, or in the other direction - and that cost increases the longer the string is.
The inline optimization is intended to ameloriate this cost for the common case - strings under 16 bytes - which I imagine covers most JS identifiers (but please tell me if your experience shows otherwise).
OXC's AST cannot contain types which own heap allocations, due its use of an arena allocator (discussed on https://github.com/rolldown-rs/rolldown/issues/427). For consumers who really want to maximize performance, the best route is to work with Atom<'_>
instead of CompactStr
. This can be somewhat painful because of dealing the lifetime, but it's the only way to "go with the grain" of OXC's design, while preserving all possible performance. CompactStr
is provided just as a convenience where consumer finds it impractical to have a lifetime.
NB: The above refers to Atom<'_>
and CompactStr
as they will be, not as they are now. So far, we've got the API to its intended shape, but the inline optimization is not implemented yet.
@hyf0 Just to add: The use of an arena allocator in OXC, and all the lifetimes it necessitates, is an unusual pattern in Rust. Personally I see it as a feature not a bug - it's extremely performant, but beyond that, OXC essentially managing its own memory also enables some really significant things which aren't otherwise possible.
However, at present it can be really painful for the consumer. I think we could better document patterns for interfacing with the allocator and AST which take advantage of it, and make the lifetimes (almost) painless. Such patterns do exist, but they're not immediately obvious.
So in short, from what I understand, the problem that you're facing is very real. But there may be a different solution from the one you're looking for with CompactStr
.
I was wondering if we could make the string type generic like Program<'_, S>
so it could gains advantages in every scenario.
Use Program<'a, Atom<'a>>
in scenario for oxc_module_lexer
and linter and other scenarios that don't need to interreact with string frequently.
Use Program<'a, OtherStr>
for transformer or bundler that require certain optimizations for the string type.
This is not my decision to make, but my view is this isn't a great solution for 2 reasons:
Firstly, it's not just Program<'a>
which would need to become Program<'a, S>
. Every AST type would also need it e.g. ExpressionStatement<'a, S>
, Expression<'a, S>
, IfStatement<'a, S>
etc.
This would be very unergonomic, and would make the API pretty horrible for other external consumers of OXC. So much so, that the solution I've proposed for handling different Span
types in https://github.com/oxc-project/oxc/issues/959#issuecomment-1982302527 went to great lengths to avoid exactly this, because I thought it was clearly unacceptable for readability/ergonomics.
Secondly, if your motivator is performance, my guess is that it wouldn't help with that. Yes, the string type wouldn't require conversions, so that's a gain, but on the other side, the whole AST would have to become Drop
, otherwise you have memory leaks - which means a complete tree traversal when dropping an AST. This is just a guess, but my assumption is that negative would outweigh the positive, and performance would be worse overall, not better.
It seems I have ended up being the guy who just says "no" all the time! That's not my nature, and it gives me no pleasure. I totally get why you're looking at every angle to find a solution, but it's just that I don't think this circle can be squared in the way you want. It's a fundamental limitation/feature/trade-off in the nature of the arena allocator which OXC uses.
More positively, can I suggest:
Atom<'a>
and CompactStr
is implemented, and then see how bad this performance problem is in practice after that?I'm afraid I wouldn't be able to do either of those things in the near future, but if we can put this on the back burner for a while, I'd be very happy to do that.
In meantime, if you have time, could you possibly outline in a bit more detail exactly what problem you're hitting?
Cool. I wasn't saying we have to do this way but just tried giving any thought for inspiration.
This would be very unergonomic
Yeah it may be. But since we are already dealing with lifetime. I guess if we do it this way, it's just another dx problem that need to be handled. For example, most of users would handle the default ast type like type Program = ProgramGeneric<'a, Atom<'a>>
, no differences compared to the current type. Only advanced users would use it.
Again, These are thoughts just for inspiration, not we have to do it.
Secondly, if your motivator is performance, my guess is that it wouldn't help with that. Yes, the string type wouldn't require conversions, so that's a gain, but on the other side, the whole AST would have to become Drop, otherwise you have memory leaks
Yeah. You could see Drop
that actually do something takes much time here. but the thing for oxc is that oxc ast could defautly has empty drop method and what's the cost of calling these method.
If the costing is on-demand(only when you used owed-string type) and the empty Drop
has little cost. I guess the whole Drop
thing won't be a problem anymore.
Can we wait until the inline optimization to Atom<'a> and CompactStr is implemented, and then see how bad this performance problem is in practice after that?
Totally, I'm not hurry for this but try to have collision of ideas, it's hard for me to find people to discuss such things.
Could we have a proper "in person" chat about all this some time?
This would be wonderful. Since this is not an urgent problem, we could wait until the proper timing. Like your actions of pursuing perfection on oxc. I learn a lot of practical knowledge on rust from that.
OK, let's leave the broader issue of owned/borrowed strings for time being and come back to it when we both have time to really get into the discussion.
Just to come back on a couple of points briefly:
Yeah. You could see
Drop
that actually do something takes much time here. but the thing for oxc is that oxc ast could defautly has empty drop method and what's the cost of calling these method.If the costing is on-demand(only when you used owed-string type) and the empty
Drop
has little cost. I guess the wholeDrop
thing won't be a problem anymore.
Yes, you're right. But what I'm saying is that in your use case, using the owned string type variant, you would have a Drop
which does a complete tree traversal, so it would cost you.
Totally, I'm not hurry for this but try to have collision of ideas, it's hard for me to find people to discuss such things.
I have the same problem! Always very happy to discuss.
This would be wonderful. Since this is not an urgent problem, we could wait until the proper timing. Like your actions of pursuing perfection on oxc. I learn a lot of practical knowledge on rust from that.
OK great. Agreed. By the way, it's very kind of you to say that. In return, congratulations on getting Rolldown in shape to be be open sourced in such a short time. Brilliant work.
Not working on this at present, but just wanted to make a note of a thought I had, before it gets lost in the ether...
I am wondering if we should have two different strings types:
It strikes me that the 2 are used in quite different ways. Identifiers are constantly being compared to each other to calculate scopes etc, and tend to be small. Other strings (e.g. string literals) can be much larger, and are generally just moved around rather than compared.
So maybe we could have 2 implementations optimized for the 2 different use cases. In particular fast hashing is likely a priority for identifiers, inlining is probably worthwhile, and maybe even interning. But all these things would be much less useful for string literals etc, so maybe they should just be a wrapper around &str
, and avoid the overhead of a "clever" implementation which never gets used.
If we did go with interning, maybe identifier strings could even be stored in a separate long-lived arena. So individual ASTs can come and go, but the identifiers stay. Maybe this could help with @hyf0's use case, even in the presence of Rolldown's incremental compilation. Obviously, multi-threading also comes into play, so this isn't simple. But maybe...
Just a thought...
Good thought! Sounds like an advanced version of https://github.com/rust-analyzer/smol_str.
A thought to come back to later:
IdentifierReference
contains both Atom<'a>
and ReferenceId
. Can we get rid of Atom
here, and store the atom in the reference which ReferenceId
points to?
This is interesting: https://github.com/xacrimon/dashmap/pull/287
We are commonly hashing Atom
s when e.g. building scope tree. Possibly could avoid repeated hashing and memory accesses by storing hash inline in Atom
.
Further thought:
I think we need a separate type Ident<'a>
for identifiers which are commonly used as keys in hash maps (bindings and references, basically).
Ident<'a>
should have the invariant that it is not an empty string. Then we can provide cheap Ident::first_char
, Ident::last_char
, Ident::first_byte
, and Ident::last_byte
methods which have no bounds checks.
Ident
Ident<'a>
is essentially a &'a str
, except with max len of u32::MAX
, and with hash stored inline. If we use a 32-bit hash:
// 16 bytes
struct Ident<'a> {
ptr: NonNull<u8>,
len: u32,
hash: u32,
_marker: PhantomData<&'a str>,
}
PartialEq::eq
becomes very cheap for non-equal Ident
s: If lengths or hashes do not match, no need to read or compare the actual strings.
Hash::hash
becomes just a read of the hash
field (extremely cheap).
The parser would generate Ident
s and store them in AST where we currently have Atom
s.
Many identifiers already have fairly expensive string comparisons in lexer, in order to check if they are keywords (await
etc). Calculating hash will probably not cost too much more than these checks already cost.
Possibly, we could make the hash function a perfect hash for these keywords to further increase the speed of keyword checks.
ident!
macro for static Ident
s which pre-calculates hashes at build time for common identifiers. Macro either generated in build script or codegen.
impl Ident<'static> {
const fn new_const_hashed(s: &'static str, hash: u32) -> Self {
assert!(s.len() <= u32::MAX as usize);
Self {
ptr: unsafe { NonNull::new_unchecked(std::ptr::from_ref(s).cast::<u8>().cast_mut()) },
len: s.len() as u32,
hash,
_marker: PhantomData,
}
}
}
macro_rules! ident {
(React) => Ident::new_const_hashed("React", 0xF8506214);
(createElement) => Ident::new_const_hashed("createElement", 0x4192A408);
// etc
}
let react_ident = ident!(React);
We could also intern idents (in local stores, 1 store per AST, not Sync
, so no locks).
IdentId
would be a 32-bit ID (like SymbolId
etc). This is what would be stored in AST where we currently have Atom<'a>
.
The actual Ident
s would be stored in a side array IndexVec<IdentId, Ident>
with a HashMap<Ident, IdentId>
to look up IdentId
for any string.
In parser, identifiers would be converted to IdentId
s by hashing them and then looking up IdentId
in the hash map. The same identifier name would always be resolved to the same IdentId
.
Now comparing 2 identifiers is a very cheap comparison of two u32
s, not an expensive string comparison. And hash maps can be keyed by IdentId
, rather than strings. Hashing a u32
with FxHash
is only 3 CPU ops, making hash map lookups really cheap, and also rehashing when a hashmap grows also much less expensive.
Also look into this for string comparisons: https://crates.io/crates/fastcmp
Unclear how much faster than std
it is for short strings (which are majority of JS identifiers).
This is a continuation of
Where the requirements are:
Make our own version of
Atom<'a>
andCompactString
.These 2 types should be immutable and inline-able. The size of these two types (either 16 or 32 bytes) should be dependent on real world statistics of the length of identifiers.