Remove HTML tags as part of normalize_string

jknndy commented 2 months ago

As a result of conversation in #1116

Adjusts normalize_string to remove HTML tags from the output string

jayaddison commented 2 months ago

This is probably a really good idea, and I'm surprised that introducing it doesn't affect more of our test data!

Two considerations:

We process a lot of text - does this code change introduce much performance overhead? (for individual scrapes, it's probably not noticeable. but our code might get used a lot currently and in future)
Are we somehow locking-out cases where scrapers do need to access the HTML tags to produce better output? (perhaps not; and in any case normalize_string is optional)

jknndy commented 2 months ago

This is probably a really good idea, and I'm surprised that introducing it doesn't affect more of our test data!

And even more surprising was that every test it did affect was in a positive way!

We process a lot of text - does this code change introduce much performance overhead? (for individual scrapes, it's probably not noticeable. but our code might get used a lot currently and in future)

On my local machine it added .1 of a second on average over a few runs with vs w/o, so i dont think this is something worth putting to much consideration on

Are we somehow locking-out cases where scrapers do need to access the HTML tags to produce better output? (perhaps not; and in any case normalize_string is optional)

I was also thinking this could be an issue going forward but i think the benefits outweigh the possibility of issue . it seems like usually only one .replace of normalize_string is used per scraper call so for instances where it is causing issues it'd be better to hardcode the other IMO. Alternatively we could introduce this as a separate function, something like replace_html_tags, which could be the beneficial if there are other aspects of the html that should be specifically excluded but otherwise is just an overcomplication.

All that to say i am in favor of this PRs implementation but open to exploring other options if others opinions differ

jayaddison commented 2 months ago

Ok; I'm often cautious and pessimistic, as that earlier performance-overhead question indicates (I could have asked: does this improve runtime performance, because it trims so many characters from the input that the rest of the code benefits? -- but I didn't!). Anyway: I agree with your responses.

hhursev / recipe-scrapers

Remove HTML tags as part of normalize_string #1117