huggingface / datatrove

Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Apache License 2.0
1.97k stars 139 forks source link

Naming Gopher's "max_non_alpha_words_ratio" #289

Open BramVanroy opened 1 week ago

BramVanroy commented 1 week ago

In the Gopher filter, there's this filter

# that 80 % of words in a document contain at least one alphabetic character
if (
    self.max_non_alpha_words_ratio
    and sum([any((c.isalpha() for c in w)) for w in words]) / n_words < self.max_non_alpha_words_ratio
):
    return False, "gopher_below_alpha_threshold"

Given that all documents that have a LOWER ratio are removed, I would expect the variable name to be _min_non_alpha_wordsratio, similar to all other variable names.