Currently we don't do any special tokenizing of URLs, which leads to weird behaviors where we treat them as normal text strings, which leads to parst of URLs showing up in lots of normalized text flows.
Really, early on in normalizing we should just completely drop things that look like URLs, treating them like effectively a stop word (would that be sufficient or would we have to do something more special?)
Currently we don't do any special tokenizing of URLs, which leads to weird behaviors where we treat them as normal text strings, which leads to parst of URLs showing up in lots of normalized text flows.
Really, early on in normalizing we should just completely drop things that look like URLs, treating them like effectively a stop word (would that be sufficient or would we have to do something more special?)