GET Articles (Hoaxy API): strange behavior on "date_published" query filter

giapippa commented 5 years ago

Hello there, I found some strange behavior with the Lucene index on "date_published". (using Hoaxy API on rapida-I) My goal is to retrieve all articles collected by Hoaxy at the highest granularity (hours or minutes). I noticed that: 1) the term range filter has problems on the same day (so no hope of filtering on different hours of the same day) 2) the simple query has problem with the "T" (I had to use a "?").

Maybe I misunderstood Lucene query index but I was able to find a solution to crawl the desired articles using something like:

date_published:2019-03-13?00* 74 results/entries
- date_published:2019-03-13?01* 22 results/entries
- date_published:2019-03-13?02* 50 results/entries
- etc…

Francesco

TWO NEGATIVE EXAMPLES:

TWO POSITIVE EXAMPLES

filmenczer commented 5 years ago

@shaochengcheng any comments on this?

shaochengcheng commented 5 years ago

Hello @giapippa, thanks for your feedback.

Firstly, a quick answer: you are right, I just aware of this strange behavior and marked it as a bug. I am sorry that I could not fix it right now.

Secondly, I would like to clarify that it is not correct when you used terms like '2014-01-01?00*'.

The field 'date_published' is stored and indexed as a string in format 'yyyy-mm-ddThh:mm:ss', which is one of the standard DateTime formats.
When filtering a field by a range, it compares the term in the way of lexicographical order (like a dictionary). The term '2014-01-01?00*' would be treated as a string, in which each wildcard is just a character without any function. Please check lucene query syntax

Finally, It seems that the range filter of 'date_published' works well for the day resolution. Thus an alternative way to reach your goal is to relax your query in a day resolution, e.g., [2015-01-01 TO 2015-01-02]. Then you could refine the data by your own script.

I am working on this bug now and will let you know when fixed.

Thanks Chengcheng

giapippa commented 5 years ago

Hello @shaochengcheng, thanks for the remark on the regex.

As a matter of fact I was already using the day-by-day filter -- as you suggest -- but then I could only obtain up to 100 results per query (which did not always correspond to the number of this). By the way, is it something you plan to change in the future (e-g- allowing a persistent connection in different chunks, maybe)?

Francesco

shaochengcheng commented 5 years ago

Hi @giapippa,

We have an internal ranking algorithm to rank the queried results by both their lucene scores and sharing tweets. Currently, at most 100 results would be returned, which you may also notice from our frontend (hoaxy.iuni.iu.edu).

@filmenczer, we will discuss in the next meeting whether to open an option to enable the article API to return as many results as possible for advanced users.

filmenczer commented 5 years ago

We discussed these issues. Regarding the issue with time searches, @shaochengcheng is looking into it.

Regarding the 100-result limit, this limit is necessary so that our back-end does not get overwhelmed and is able to respond to queries from the front-end.

filmenczer commented 4 years ago

From @shaochengcheng: "The bug ... is about how Lucene search and index datetime things. The current implementation uses seconds since the Epoch (Wrong), which means that we use integers to store the publication datetime. When searching, a kind of range filter is used. It is not clear why Lucene does not work as expected. If the current logic is right, then there are coding bugs in the implementation. Otherwise, we should switch to other ways to handle datetime, e.g., use datetime string directly."

From @shaochengcheng : we store datetime object as a string with format '%Y-%m-dT%H%M%S‘, after reviewing codes.

shaochengcheng commented 4 years ago

@chathuriw, here are our discussion.

Message from Chathuri

Hi Chengcheng,

I'm looking at this issue and it seems to be working for me from rapid API for date time in the format of "date_published:[2016-10-28T01 TO 2016-12-04T00]".

I'm not sure whether I'm missing something here.

lucene

Thanks, Chathuri

Message from Chengcheng

Dear Chathuri,

Sorry for my late response, I need to refresh my mind by reviewing some codes. Let me explain the issue with more details.

The current implementation

When searching on structured documents, two ways can be considered:

search on specified fields: we know which parts of the structured document are going to be searched, and it would be very precise;
fuzzy search: search on all indexed fields.

With the power of Lucene, we combine these two methods in the implementation of Hoaxy: fuzzy search with the very precise search on specified fields. Example:

vaccine AND title:trump

This query string would match any fields contains the word 'vaccine' and its title should have the word 'trump' inside.

Things become a little bit complicated when handling datetime objects. One way is to save datetime objects is using seconds to the EPOCH (e.g., UNIX time). However, either the user should calculate them as inputs (inconvenient for users), or I need to parse the input to check and convert the datetime type field parts (too complicate). Thus I decided to use STRING to store and index the datetime objects, please check.

Problem The stored date_published field is a string type, looks like 'yyyy-mm-ddThh:mm:ss'. As Lucene documented, range searches would return all values that are between the lower and upper bound specified by the Range Query. However, Hoaxy behaves strangely. For example, the first article in our Hoaxy Endpoint Article Example, the structured part for date_published is "date_published":"2016-09-08T11:58:32.000Z"

Assuming, I am going to narrow down the search to the very specified item, the query string (no more other string) could be,

date_published:[2016-09-08T11:00:00 TO 2016-09-08T12:00:00]

However, there are no matched results! Even if the query includes the whole day,

date_published:[2016-09-08T00:00:00 TO 2016-09-08T23:59:59]

it shows nothing. Only when I expand the query to one day before,

date_published:[2016-09-07T00:00:00 TO 2016-09-08T23:59:59]

did the results come out.

Moreover, the following queries:

date_published:[2016-09-08T00:00:00 TO 2016-09-08T23:59:59]
date_published:[2016-09-08T23:00:00 TO 2016-09-08T00:00:59]

show no differences in results, indicating that Lucene only recognizes the date part of datetime string regardless of anything else.

So the question is why our datetime string does not work as expected?

Thanks -- Chengcheng

chathuriw commented 4 years ago

Hi Chengcheng,

I think this is working now. What I did was, I update the search function as if the date_published field included in the query, I removed it and add it as a TermRangeQuery and combined it with the parsed query.

This is the code.

    def search(self,
               query,
               n1=100,
               n2=100000,
               sort_by='relevant',
               use_lucene_syntax=False,
               min_score_of_recent_sorting=0.4,
               min_date_published=None):
        """Return the matched articles from lucene.

        Parameters
        ----------
        query : string
            The query string.
        n1 : int
            How many result finally returned.
        n2 : int
            How many search results returned when sort by recent.
        sort_by : string
            {'relevant', 'recent'}, the sorting order when doing lucene searching.
        min_score_of_recent_sorting : float
            The min score when sorting by 'recent'.
        min_date_published : datetime
            The min date_published when filtering lucene searching results.

        Returns
        -------
        tuple
            (total_hits, df), where total_hits represents the total number
            of hits and df is a pandas.DataFrame object. df.columns = ['id',
            'canonical_url', 'title', 'date_published', 'domain', 'site_type',
            'score']
        """
        if min_date_published is not None:
            dt2 = datetime.utcnow()
            if isinstance(min_date_published, datetime):
                dt1 = min_date_published
            elif isinstance(min_date_published, str):
                dt1 = utc_from_str(min_date_published)
            q_dates = self.query_between_dates(dt1, dt2)
        try:
            if use_lucene_syntax is False:
                query = clean_query(query)
            q = self.mul_parser.parse(self.mul_parser, query)
            logger.warning(q)
            if 'date_published:' in query:
                end = query.find('AND date_published')
                q_without_date_publushed = query[:end]
                logger.warning(q_without_date_publushed)
                q = self.mul_parser.parse(self.mul_parser, q_without_date_publushed)
                date_published_splits = query.split('date_published:[')
                date_range = date_published_splits[len(date_published_splits) - 1]
                date_range = date_range[:-1]
                logger.warning(date_range)
                if 'TO' in date_range:
                    date_range_splits = date_range.split('TO')
                    dt1 = utc_from_str(date_range_splits[0])
                    dt2 = utc_from_str(date_range_splits[1])
                    query_dates = self.query_between_dates(dt1, dt2)
                    q = combine_queries(q, query_dates)
            if min_date_published is not None:
              q = combine_queries(q, q_dates)
            logger.warning('Parsed query: %s', q)
        except Exception as e:
            logger.error(e)
            if use_lucene_syntax is True:
                raise APIParseError("""Error when parse the query string! \
You are quering with lucene syntax, be careful of your query string!""")
            else:
                raise APIParseError('Error when parse the query string!')

        cnames = [
            'id', 'canonical_url', 'title', 'date_published', 'domain',
            'site_type', 'score'
        ]
        if sort_by == 'relevant':
            top_docs = self.isearcher.search(q, n1)
            score_docs = top_docs.scoreDocs
            total_hits = top_docs.totalHits
            logger.warning(total_hits)
            if total_hits == 0:
                df = pd.DataFrame()
            else:
                records = [self.fetch_one_doc(sd) for sd in score_docs]

                # Index in each record of canonical URL and title
                canonical_url, title = 1, 2
                # Store 2-tuples of (site, article title) as keys in dict then
                # turn back to list
                unique_docs = dict()
                for record in records:
                  key = (record[canonical_url], record[title])
                  if key not in unique_docs:
                    unique_docs[key] = record
                # Include only unique records
                records = list(unique_docs.values())
                df = pd.DataFrame(records, columns=cnames)
                df['date_published'] = pd.to_datetime(df['date_published'])
            return total_hits, df

I did not push the changes to the master, but it is updated in the running version for testing now. Please let me know if you see any issues with the code and then I will push the changes to the master.

shaochengcheng commented 4 years ago

@chathuriw, great work. I understand your approach and believe it would work as expected. However, we just bypass the question WHY HOAXY BEHAVES LIKE THAT. Is it caused by the Lucene Multiparser, or anything else? Moreover, if we have more datetime fields, it is not a graceful solution. Anyway, it works now and because of the limited time, we could close this issue.

chathuriw commented 4 years ago

@shaochengcheng I pushed the changes to master after handling the case when the string contains regex (https://github.com/IUNetSci/hoaxy-backend/commit/0f1386e787de2f4ed0d12c6e28c5520dd3d1eed9)

osome-iu / hoaxy-backend