Closed JeffreyYStewart closed 2 years ago
@JeffreyYStewart: can you see why in the scrubber code?
Yes I can, but I am not sure if the mistake is in the documentation or the code. Do we want the numbers to be removed without leaving a space or do we want to leave the space?
fair Q ...
The documentation probably needs some tweaking, but I think there is a bug in the code:
return str(re.sub(pattern, r" ", text))
There probably shouldn't be a space there. If you want to replace digits with a space, you should use scrubber.replace.digits.
Alternatively, we could give scrubber.remove.digits
as replace_with_space
parameter (or something like that). This might be a useful convenience, and the user would be more likely to go to scrubber.replace.digits
if they want to replace digits with something like the default value "_DIGIT_".
I think it would make the most sense to have scrubber.remove.digits
just remove the digit and leave any form of replacing to scrubber.replace.digits
.
Issue has been resolved.
The scrubber tutorial indicates that scrubber.remove.digits should remove the space a digit occupies when it removes. This is not currently the case.
This is the example from the tutorial (without the other pipeline steps):
scrubbed_text = remove_digits("Lexos is the number 12 text analysis tool.", only= "1")
The output indicated by the tutorial is:
"Lexos is the number 2 text analysis tool"
However, the true output is:
"Lexos is the number 2 text analysis tool"
The space that the "1" occupied was not removed.