Closed giapippa closed 4 years ago
@shaochengcheng any comments on this?
Hello @giapippa, thanks for your feedback.
Firstly, a quick answer: you are right, I just aware of this strange behavior and marked it as a bug. I am sorry that I could not fix it right now.
Secondly, I would like to clarify that it is not correct when you used terms like '2014-01-01?00*'.
Finally, It seems that the range filter of 'date_published' works well for the day
resolution. Thus an alternative way to reach your goal is to relax your query in a day
resolution, e.g., [2015-01-01 TO 2015-01-02]. Then you could refine the data by your own script.
I am working on this bug now and will let you know when fixed.
Thanks Chengcheng
Hello @shaochengcheng, thanks for the remark on the regex.
As a matter of fact I was already using the day-by-day filter -- as you suggest -- but then I could only obtain up to 100 results per query (which did not always correspond to the number of this). By the way, is it something you plan to change in the future (e-g- allowing a persistent connection in different chunks, maybe)?
Francesco
Hi @giapippa,
We have an internal ranking algorithm to rank the queried results by both their lucene scores and sharing tweets. Currently, at most 100 results would be returned, which you may also notice from our frontend (hoaxy.iuni.iu.edu).
@filmenczer, we will discuss in the next meeting whether to open an option to enable the article
API to return as many results as possible for advanced users.
We discussed these issues. Regarding the issue with time searches, @shaochengcheng is looking into it.
Regarding the 100-result limit, this limit is necessary so that our back-end does not get overwhelmed and is able to respond to queries from the front-end.
From @shaochengcheng: "The bug ... is about how Lucene search and index datetime things. The current implementation uses seconds since the Epoch (Wrong), which means that we use integers to store the publication datetime. When searching, a kind of range filter is used. It is not clear why Lucene does not work as expected. If the current logic is right, then there are coding bugs in the implementation. Otherwise, we should switch to other ways to handle datetime, e.g., use datetime string directly."
From @shaochengcheng : we store datetime object as a string with format '%Y-%m-dT%H%M%S‘, after reviewing codes.
Hi Chengcheng,
I'm looking at this issue and it seems to be working for me from rapid API for date time in the format of "date_published:[2016-10-28T01 TO 2016-12-04T00]".
I'm not sure whether I'm missing something here.
Thanks, Chathuri
Dear Chathuri,
Sorry for my late response, I need to refresh my mind by reviewing some codes. Let me explain the issue with more details.
When searching on structured documents, two ways can be considered:
With the power of Lucene, we combine these two methods in the implementation of Hoaxy: fuzzy search with the very precise search on specified fields. Example:
vaccine AND title:trump
This query string would match any fields contains the word 'vaccine' and its title should have the word 'trump' inside.
Things become a little bit complicated when handling datetime objects. One way is to save datetime objects is using seconds to the EPOCH (e.g., UNIX time). However, either the user should calculate them as inputs (inconvenient for users), or I need to parse the input to check and convert the datetime type field parts (too complicate). Thus I decided to use STRING to store and index the datetime objects, please check.
Assuming, I am going to narrow down the search to the very specified item, the query string (no more other string) could be,
date_published:[2016-09-08T11:00:00 TO 2016-09-08T12:00:00]
However, there are no matched results! Even if the query includes the whole day,
date_published:[2016-09-08T00:00:00 TO 2016-09-08T23:59:59]
it shows nothing. Only when I expand the query to one day before,
date_published:[2016-09-07T00:00:00 TO 2016-09-08T23:59:59]
did the results come out.
Moreover, the following queries:
date_published:[2016-09-08T00:00:00 TO 2016-09-08T23:59:59]
date_published:[2016-09-08T23:00:00 TO 2016-09-08T00:00:59]
show no differences in results, indicating that Lucene only recognizes the date part of datetime string regardless of anything else.
So the question is why our datetime string does not work as expected?
Thanks -- Chengcheng
Hi Chengcheng,
I think this is working now. What I did was, I update the search function as if the date_published field included in the query, I removed it and add it as a TermRangeQuery and combined it with the parsed query.
This is the code.
def search(self,
query,
n1=100,
n2=100000,
sort_by='relevant',
use_lucene_syntax=False,
min_score_of_recent_sorting=0.4,
min_date_published=None):
"""Return the matched articles from lucene.
Parameters
----------
query : string
The query string.
n1 : int
How many result finally returned.
n2 : int
How many search results returned when sort by recent.
sort_by : string
{'relevant', 'recent'}, the sorting order when doing lucene searching.
min_score_of_recent_sorting : float
The min score when sorting by 'recent'.
min_date_published : datetime
The min date_published when filtering lucene searching results.
Returns
-------
tuple
(total_hits, df), where total_hits represents the total number
of hits and df is a pandas.DataFrame object. df.columns = ['id',
'canonical_url', 'title', 'date_published', 'domain', 'site_type',
'score']
"""
if min_date_published is not None:
dt2 = datetime.utcnow()
if isinstance(min_date_published, datetime):
dt1 = min_date_published
elif isinstance(min_date_published, str):
dt1 = utc_from_str(min_date_published)
q_dates = self.query_between_dates(dt1, dt2)
try:
if use_lucene_syntax is False:
query = clean_query(query)
q = self.mul_parser.parse(self.mul_parser, query)
logger.warning(q)
if 'date_published:' in query:
end = query.find('AND date_published')
q_without_date_publushed = query[:end]
logger.warning(q_without_date_publushed)
q = self.mul_parser.parse(self.mul_parser, q_without_date_publushed)
date_published_splits = query.split('date_published:[')
date_range = date_published_splits[len(date_published_splits) - 1]
date_range = date_range[:-1]
logger.warning(date_range)
if 'TO' in date_range:
date_range_splits = date_range.split('TO')
dt1 = utc_from_str(date_range_splits[0])
dt2 = utc_from_str(date_range_splits[1])
query_dates = self.query_between_dates(dt1, dt2)
q = combine_queries(q, query_dates)
if min_date_published is not None:
q = combine_queries(q, q_dates)
logger.warning('Parsed query: %s', q)
except Exception as e:
logger.error(e)
if use_lucene_syntax is True:
raise APIParseError("""Error when parse the query string! \
You are quering with lucene syntax, be careful of your query string!""")
else:
raise APIParseError('Error when parse the query string!')
cnames = [
'id', 'canonical_url', 'title', 'date_published', 'domain',
'site_type', 'score'
]
if sort_by == 'relevant':
top_docs = self.isearcher.search(q, n1)
score_docs = top_docs.scoreDocs
total_hits = top_docs.totalHits
logger.warning(total_hits)
if total_hits == 0:
df = pd.DataFrame()
else:
records = [self.fetch_one_doc(sd) for sd in score_docs]
# Index in each record of canonical URL and title
canonical_url, title = 1, 2
# Store 2-tuples of (site, article title) as keys in dict then
# turn back to list
unique_docs = dict()
for record in records:
key = (record[canonical_url], record[title])
if key not in unique_docs:
unique_docs[key] = record
# Include only unique records
records = list(unique_docs.values())
df = pd.DataFrame(records, columns=cnames)
df['date_published'] = pd.to_datetime(df['date_published'])
return total_hits, df
I did not push the changes to the master, but it is updated in the running version for testing now. Please let me know if you see any issues with the code and then I will push the changes to the master.
@chathuriw, great work. I understand your approach and believe it would work as expected. However, we just bypass the question WHY HOAXY BEHAVES LIKE THAT. Is it caused by the Lucene Multiparser, or anything else? Moreover, if we have more datetime fields, it is not a graceful solution. Anyway, it works now and because of the limited time, we could close this issue.
@shaochengcheng I pushed the changes to master after handling the case when the string contains regex (https://github.com/IUNetSci/hoaxy-backend/commit/0f1386e787de2f4ed0d12c6e28c5520dd3d1eed9)
Hello there, I found some strange behavior with the Lucene index on "date_published". (using Hoaxy API on rapida-I) My goal is to retrieve all articles collected by Hoaxy at the highest granularity (hours or minutes). I noticed that: 1) the term range filter has problems on the same day (so no hope of filtering on different hours of the same day) 2) the simple query has problem with the "T" (I had to use a "?").
Maybe I misunderstood Lucene query index but I was able to find a solution to crawl the desired articles using something like:
Francesco
TWO NEGATIVE EXAMPLES:
TWO POSITIVE EXAMPLES