kaikalii / censor

Rust profanity filtering library
10 stars 3 forks source link

add count method to censor #5

Closed EvanBurbidge closed 1 year ago

EvanBurbidge commented 1 year ago

Counts occurrences of swear words in a string

EvanBurbidge commented 1 year ago

https://github.com/kaikalii/censor/issues/4

kaikalii commented 1 year ago

Any function named count should definitely return a usize. Also, Censor::replace allocates, and counting how many words are censored should be possible without allocating.

EvanBurbidge commented 1 year ago

Any function named count should definitely return a usize. Also, Censor::replace allocates, and counting how many words are censored should be possible without allocating.

Can you expand on what you mean by allocates ( only two weeks into using rust from a JS background )

kaikalii commented 1 year ago

Any function named count should definitely return a usize. Also, Censor::replace allocates, and counting how many words are censored should be possible without allocating.

Can you expand on what you mean by allocates ( only two weeks into using rust from a JS background )

Sure. Censor::replace (and indeed str::replace in the standard library) allocate memory. All the bytes of the string are stored in a buffer. When replacing characters, it's not possible to reuse the same buffer because a) it is passed immutably and b) the new string might be bigger than the old one. So, a new buffer is allocated to store the bytes of the new string. JavaScript does this too, since strings in JS are immutable. In Rust we usually care about performance more than in JS, and so it is good to avoid allocation when possible. Lots of operations allocate memory, but the main ones to keep in mind in Rust are {Box|Rc|Arc|etc}::new and anything that creates a new variable-length buffer. Rust types that contain variable length buffer include Vec, String, HashMap, etc.

What I meant by my comment was that because Censor::count would only be calculating some data, not changing it or creating a new string, it should theoretically be possible to count the censored words without allocating anything new.

I haven't looked at the code in a while though. IIRC, I think I do some frivolous allocations in some places.

But also, the method of counting censored words by counting words with *s in them would incorrectly count inputs that alreadyhave`s.

EvanBurbidge commented 1 year ago

Any function named count should definitely return a usize. Also, Censor::replace allocates, and counting how many words are censored should be possible without allocating.

Can you expand on what you mean by allocates ( only two weeks into using rust from a JS background )

Sure. Censor::replace (and indeed str::replace in the standard library) allocate memory. All the bytes of the string are stored in a buffer. When replacing characters, it's not possible to reuse the same buffer because a) it is passed immutably and b) the new string might be bigger than the old one. So, a new buffer is allocated to store the bytes of the new string. JavaScript does this too, since strings in JS are immutable. In Rust we usually care about performance more than in JS, and so it is good to avoid allocation when possible. Lots of operations allocate memory, but the main ones to keep in mind in Rust are {Box|Rc|Arc|etc}::new and anything that creates a new variable-length buffer. Rust types that contain variable length buffer include Vec, String, HashMap, etc.

What I meant by my comment was that because Censor::count would only be calculating some data, not changing it or creating a new string, it should theoretically be possible to count the censored words without allocating anything new.

I haven't looked at the code in a while though. IIRC, I think I do some frivolous allocations in some places.

But also, the method of counting censored words by counting words with *s in them would incorrectly count inputs that alreadyhave`s.

Got it so instead of creating a new variable it might be better to do something like

pub fn count(&self, text: str) -> uzise {
    self.replace(text, "*").split(" ").filter(|&w| w.contains("*")).count()
}
kaikalii commented 1 year ago

Got it so instead of creating a new variable it might be better to do something like

pub fn count(&self, text: str) -> uzise {
    self.replace(text, "*").split(" ").filter(|&w| w.contains("*")).count()
}

That's correct for the return type, but like I said, counting the number of censored words by counting the number of words with * in them would be incorrect. If my input string were *.txt, it would incorrectly tell me that there is 1 censored word.

On reviewing the code, I see I was way off about the cost savings of avoiding allocation in this case. Even the most basic analysis the library does involves allocating sets of indices.

The best solution I see is to use the Censor::bad_chars function, which gives you a set of the indices of characters that are involved in a censored word. Then, count the number of contiguous groups of bad character indices.

I've implemented it in 666709779e5961a91903c823b6929249073644ba, and I've pushed a 0.3.0 release to crates.io.

With the feature implemented, I'll close this.