scottkleinman / lexos

Development repo for the Lexos API
MIT License
1 stars 0 forks source link

scrubber.remove.digits leaves spaces #6

Closed JeffreyYStewart closed 2 years ago

JeffreyYStewart commented 2 years ago

The scrubber tutorial indicates that scrubber.remove.digits should remove the space a digit occupies when it removes. This is not currently the case.
This is the example from the tutorial (without the other pipeline steps):
scrubbed_text = remove_digits("Lexos is the number 12 text analysis tool.", only= "1")

The output indicated by the tutorial is: "Lexos is the number 2 text analysis tool"
However, the true output is:                  "Lexos is the number  2 text analysis tool"

The space that the "1" occupied was not removed.

mleblanc321 commented 2 years ago

@JeffreyYStewart: can you see why in the scrubber code?

JeffreyYStewart commented 2 years ago

Yes I can, but I am not sure if the mistake is in the documentation or the code. Do we want the numbers to be removed without leaving a space or do we want to leave the space?

mleblanc321 commented 2 years ago

fair Q ...

scottkleinman commented 2 years ago

The documentation probably needs some tweaking, but I think there is a bug in the code:

return str(re.sub(pattern, r" ", text))

There probably shouldn't be a space there. If you want to replace digits with a space, you should use scrubber.replace.digits.

Alternatively, we could give scrubber.remove.digits as replace_with_space parameter (or something like that). This might be a useful convenience, and the user would be more likely to go to scrubber.replace.digits if they want to replace digits with something like the default value "_DIGIT_".

JeffreyYStewart commented 2 years ago

I think it would make the most sense to have scrubber.remove.digits just remove the digit and leave any form of replacing to scrubber.replace.digits.

JeffreyYStewart commented 2 years ago

Issue has been resolved.