URLs should be treated specially in tokenizing

jkomoros / card-web

The web app behind thecompendium.cards

Apache License 2.0

46 stars 8 forks source link

URLs should be treated specially in tokenizing #619

Closed jkomoros closed 2 years ago

jkomoros commented 2 years ago

Currently we don't do any special tokenizing of URLs, which leads to weird behaviors where we treat them as normal text strings, which leads to parst of URLs showing up in lots of normalized text flows.

Really, early on in normalizing we should just completely drop things that look like URLs, treating them like effectively a stop word (would that be sufficient or would we have to do something more special?)

jkomoros commented 2 years ago

Actually the current behavior for urls is MOSTly correct.

The main problem is that URLs are split up on dashes in URLs in normalizedWords

jkomoros commented 2 years ago

I think that as of 40eec3e this is mostly fixed