snguyenthanh / better_profanity

Blazingly fast cleaning swear words (and their leetspeak) in strings
MIT License
211 stars 70 forks source link

Add unit tests for large input strings and a large corpus #20

Open jcbrockschmidt opened 3 years ago

jcbrockschmidt commented 3 years ago

In response to issue #19, we should add unit tests that run on large strings, as well as on a large corpus of strings. This should help us catch speed inefficiencies down the road.

snguyenthanh commented 3 years ago

I think this is a good idea. A problem is about where to store the test dataset, as I would prefer not to have a very big text file in the package which is used only for testing.

I would come up with a way to download and remove the dataset only for running tests. @jcbrockschmidt Can you help benchmark and improve the current algo for long texts ?

jcbrockschmidt commented 3 years ago

I am definitely willing to help. For the large unit tests, I was thinking it may be good enough to just include a dozen or so paragraphs (and their censored counterparts) and repeatedly test them 10-ish or 100-ish times. I think as long as the set of repeated paragraphs include an even mix of paragraphs with 1) no censored words 2) some censored words and 3) a lot of censored words it should be good enough to catch large slow-downs. This dataset shouldn't take up more than a few MBs.

It may, however, be a good idea to have a separate, more extensive benchmarking script separate from these new unit tests. For this script, yes: I think downloading the dataset would be wiser. I have a rough benchmarking script already written. The biggest challenge will just be finding a download link for our dataset that's reliable. The dataset I'm currently using is hosted on a lot of different websites with questionable reliability, so I'd need to track down its origin.

jcbrockschmidt commented 3 years ago

This link might be reliable enough for the Amazon reviews dataset I was looking at. We probably want to throw some extra datasets in the mix, though, such as very long documents (i.e. short stories or a books) with some profanity included.