rust-lang / libs-team

The home of the library team
Apache License 2.0
116 stars 18 forks source link

Add a `CStr::count_bytes()` method #256

Closed tgross35 closed 12 months ago

tgross35 commented 1 year ago

Proposal

Problem statement

Currently, the best way to evaluate the length of a CStr is cs.to_bytes().len(). This is not ergonomic, and may not be the best option when CStr becomes a thin pointer.

Motivating examples or use cases

For both a fat and thin CStr, this will provide a more intuitive method to get the length. When CStr becomes thin, CStr::count_bytes will be able to call strlen directly without going through the less direct to_bytes > to_bytes_with_nul > slice::from_raw_parts > len.

Solution sketch

Add a method CStr::count_bytes that returns its length. Currently, we can use a self.inner.len() - 1 to get the length of the string, which is a constant time operation. Once CStr becomes thin, this will become an O(1) call to strlen - we just need to make it clear in documentation that this will have a performance regression at some point (as we do for the to_bytes() methods).

The API is simple:

impl CStr {
    pub fn count_bytes(&self) -> usize {
        self.inner.len()
    }
}

Once thin, this will become:

impl CStr {
    pub fn count_bytes(&self) -> usize {
        strlen(self.inner_ptr)
    }
}

Alternatives

The status quo: accessing length via .to_bytes().len(). This ACP intends to improve upon this option.

constness

Currenly CStr::from_ptr uses a version of strlen that calls libc's implementation normally, or a naive Rust implementation when const. This works via const_eval_select which to my knowledge hasn't been okayed for internal use. So, we will not be able to make CStr::count_bytes const stable until either that happens, or we decide our naive strlen is always OK (making CStr::new const is also blocked by this, or by the switch to a thin pointer).

My current implementation gates the constness under const_cstr_from_ptr which has the same const-stable blocker.

cc @oli-obk because of your comment onconst_eval_select here https://github.com/rust-lang/rust/pull/107624#discussion_r1101911091 (I believe this is the RFC: https://github.com/rust-lang/rfcs/pull/3352)

Naming

The name count_bytes is up for some bikeshedding.

Links and related work

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

tgross35 commented 1 year ago

Not yet approved of course, but this is small enough that I got started on work.

Tracking issue: https://github.com/rust-lang/rust/issues/114441

Implementation PR: https://github.com/rust-lang/rust/pull/114443

kadiwa4 commented 1 year ago

That same performance regression will also inevitably happen to CStr::to_bytes (if the implementation switch is actually done), and it already has a note in the docs. Why not do the exact same thing to CStr::len (which is equivalent to c_str.to_bytes().len() as you pointed out) and use a performant impl for now?

cuviper commented 1 year ago

I agree, we should not hamstring this on purpose. A similar doc note would be appropriate, so we don't make promises we might not want to keep, but we would add this to the performance considerations of making that switch.

tgross35 commented 1 year ago

That is all fair. I have updated the proposal to not call strlen at this time, will do my implementation PR soon (changed)

m-ou-se commented 1 year ago

We discussed this in last week's libs-api meeting. Since the plan is still for &CStr to become a thin pointer, such that determining the length will be a O(N) operation at some point in the future, we don't think .len() is be an appropriate name for this method, as it hides the fact that some actual 'work' is done by this function.

The .to_bytes() method is called to_bytes and not as_bytes because it's not a 'free' conversion. Similarly, a length method should make clear it is not a O(1) operation, perhaps by adding a verb into the method name. (For example, count_bytes might be a good method name.)

programmerjake commented 1 year ago

For example, count_bytes might be a good method name.

in that case there should be a doc alias for len and strlen

tgross35 commented 1 year ago

Thanks @m-ou-se. I updated this proposal and the implementation to use count_bytes instead, with the doc alias that @programmerjake suggested.

I am wondering if there may be a better name yet since count_bytes() will not return the same value as to_bytes().len() - but I don't know what would be better and don't feel super strongly about it. Maybe calc_len()? Nevermind, I forgot we have two different to_bytes functions

programmerjake commented 1 year ago

I am wondering if there may be a better name yet since count_bytes() will not return the same value as to_bytes().len()

isn't count_bytes supposed to return the same value as strlen?! v.to_bytes().len() == strlen(v.as_ptr())

perhaps you were thinking of to_bytes_with_nul()?

tgross35 commented 1 year ago

perhaps you were thinking of to_bytes_with_nul()?

Yes I am, please ignore me 🙂

dtolnay commented 12 months ago

I would be happy to accept a PR for unstable count_bytes. Thank you!