mediacloud / backend

Media Cloud is an open source, open data platform that allows researchers to answer quantitative questions about the content of online media.
http://www.mediacloud.org
GNU Affero General Public License v3.0
280 stars 87 forks source link

sentence list not returning correct results #505

Open rahulbot opened 5 years ago

rahulbot commented 5 years ago

Noticed by a user running some Explorer queries. This following test snippet verifies what we are seeing in the UI when clicking a date on the attention chart - sentenceList is returning sentences with text that does not contain the query keywords:

keyword = "Carnegie Endowment for International Peace"
matching_sentences = mc.sentenceList(u'("Carnegie Endowment for International Peace") AND (( tags_id_media:(58722749)))',
                'publish_day:[2018-10-24T00:00:00Z TO 2018-10-26T00:00:00Z]', rows=10)
for s in matching_sentences[:10]:
    is_match = keyword in s['sentence']
    print("{}: {}".format(s['story_sentences_id'], is_match))

Shows that only 1 of the 10 sentences returned actually have the search string in them

24625899778: False
24625899779: True
24625899788: False
24625899798: False
24625899804: False
24625899810: False
24625899815: False
24625899816: False
24625899817: False
24625899828: False

(This is potentially the same underlying bug as #500)

rahulbot commented 5 years ago

Hal: This is happening because everything is turned into ORs