propublica / Capitol-Words

Scraping, parsing and indexing the daily Congressional Record to support phrase search over time, and by legislator and date
BSD 3-Clause "New" or "Revised" License
122 stars 34 forks source link

Odd results for ". . ." #84

Open nickom opened 10 years ago

nickom commented 10 years ago

http://capitolwords.org/term/._._._/

Found because it was listed as the top 5 word phrase for this date: http://capitolwords.org/date/2014/04/28/

screen shot 2014-05-06 at 4 35 33 pm

drinks commented 10 years ago

Yeah, this is a known issue. We use an ngram parser similar to Google's, which treats punctuation as distinct tokens. I believe these are low-volume days that have either sequences of dots in rollcalls or similar 'table of contents' style pages. Definitely on the list.

nickom commented 10 years ago

Gotcha. The other thing that was so odd to me was that the highlighted examples had letters in them: screen shot 2014-05-06 at 4 51 44 pm

drinks commented 10 years ago

Guessing that's a separate issue related to. being the regexp for 'match any character,' code here: https://github.com/sunlightlabs/Capitol-Words/blob/2bf155cd586847ea32ed294a8a3e6997e822199e/cwod_site/cwod/views.py#L318-L332

nickom commented 10 years ago

Also, shorter versions of the dots are the top words and their links go to some server errors or 404s. Here are the links for the top words on that day:

Two words (not found): http://capitolwords.org/term/

Three words (server error): http://capitolwords.org/term/._/

Four words: http://capitolwords.org/term/._._/

Five words: http://capitolwords.org/term/._._._/