ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Add generic hashtag support #266

Open tokee opened 2 years ago

tokee commented 2 years ago

Hashtags are universal to the net, so parsing all text for #LooksLikeAHashtag and adding them to the keywords-field seems like an obvious feature.

We need to ensure that the tags are extracted from text before the text is stripped of punctuation and special characters and we need to agree on what a hashtag looks like. Are these all hashtags?

So valid characters and maximum length, I guess?