Open nickom opened 10 years ago
Yeah, this is a known issue. We use an ngram parser similar to Google's, which treats punctuation as distinct tokens. I believe these are low-volume days that have either sequences of dots in rollcalls or similar 'table of contents' style pages. Definitely on the list.
Gotcha. The other thing that was so odd to me was that the highlighted examples had letters in them:
Guessing that's a separate issue related to. being the regexp for 'match any character,' code here: https://github.com/sunlightlabs/Capitol-Words/blob/2bf155cd586847ea32ed294a8a3e6997e822199e/cwod_site/cwod/views.py#L318-L332
Also, shorter versions of the dots are the top words and their links go to some server errors or 404s. Here are the links for the top words on that day:
Two words (not found): http://capitolwords.org/term/
Three words (server error): http://capitolwords.org/term/._/
Four words: http://capitolwords.org/term/._._/
Five words: http://capitolwords.org/term/._._._/
http://capitolwords.org/term/._._._/
Found because it was listed as the top 5 word phrase for this date: http://capitolwords.org/date/2014/04/28/