Closed jknndy closed 2 months ago
This is probably a really good idea, and I'm surprised that introducing it doesn't affect more of our test data!
Two considerations:
normalize_string
is optional)This is probably a really good idea, and I'm surprised that introducing it doesn't affect more of our test data!
And even more surprising was that every test it did affect was in a positive way!
- We process a lot of text - does this code change introduce much performance overhead? (for individual scrapes, it's probably not noticeable. but our code might get used a lot currently and in future)
On my local machine it added .1 of a second on average over a few runs with vs w/o, so i dont think this is something worth putting to much consideration on
- Are we somehow locking-out cases where scrapers do need to access the HTML tags to produce better output? (perhaps not; and in any case
normalize_string
is optional)
I was also thinking this could be an issue going forward but i think the benefits outweigh the possibility of issue . it seems like usually only one .replace
of normalize_string is used per scraper call so for instances where it is causing issues it'd be better to hardcode the other IMO. Alternatively we could introduce this as a separate function, something like replace_html_tags
, which could be the beneficial if there are other aspects of the html that should be specifically excluded but otherwise is just an overcomplication.
All that to say i am in favor of this PRs implementation but open to exploring other options if others opinions differ
Ok; I'm often cautious and pessimistic, as that earlier performance-overhead question indicates (I could have asked: does this improve runtime performance, because it trims so many characters from the input that the rest of the code benefits? -- but I didn't!). Anyway: I agree with your responses.
As a result of conversation in #1116
Adjusts
normalize_string
to remove HTML tags from the output string