scrapinghub / shublang

Pluggable DSL that uses pipes to perform a series of linear transformations to extract data
BSD 3-Clause "New" or "Revised" License
15 stars 8 forks source link

Make unidecode optional in sanitize methods #67

Open VMRuiz opened 3 years ago

VMRuiz commented 3 years ago

Allow to sanitize text from non English websites without losing data.

This is not backward compatibility as I believe this should be the default behavior in most cases.

VMRuiz commented 3 years ago

Maybe unidecode could be a pipe of its own?

I think sanitize | unidecode (or sanitize | ascii_safe) is more readable than sanitize(ascii_safe=False)

Yes, I think your approach is actually better.

VMRuiz commented 3 years ago

Maybe unidecode could be a pipe of its own?

I think sanitize | unidecode (or sanitize | ascii_safe) is more readable than sanitize(ascii_safe=True)

I have implemented the method ascii_safe. I tried implementing it with unidecode but it looks like there was some name collision issue between the shublang name method and the unidecode method itself.

If you are able to fix it we could use unidecode instead. I don't really have a strong opinion on which one is better.