hbz / lobid-gnd

UI and API to the Integrated Authority File (Gemeinsame Normdatei, GND)
http://lobid.org/gnd
Eclipse Public License 2.0
25 stars 5 forks source link

exclude dateModified from search #257

Closed rettinghaus closed 4 years ago

rettinghaus commented 4 years ago

Obviously the field dcterms:modified (resp. dateModified) is indexed. Is this intentional? I find it rather confusing in searching, especially because this field isn't shown in the HTML output.

Example: looking up "heinrich ida 2020" shows results that have nothing to do with "2020", except that they have been modified this year.

If there are no compelling reasons for this, I'd recommend removing the field from search.

acka47 commented 4 years ago

Thanks for the pointer, @rettinghaus . We will look into this ASAP. It might take some more time, though, as @fsteeg is on vacation until August.

acka47 commented 4 years ago

We had a similar issue for NWBib to exclude some field from the "all fields" search, see https://github.com/hbz/nwbib/issues/110. Here is the PR with which @fsteeg solved this: https://github.com/hbz/lobid/pull/198

Maybe it makes sense to use a similar solution in this case.

dr0i commented 4 years ago

The solution was to switch from the allquery to a specialized word query, see https://github.com/hbz/nwbib/commit/330019483571ccc4450bca3db8a728fcb3d7cac6. I hope we can solve the issue in another way.

dr0i commented 4 years ago

There are several ways, @acka47 plz choose:

  1. use "dateModified": { type: date} instead of type: text. This prevents a splitting of the date string at index level, e.g. 2016-05-26T17:19:48.000 - thus a query of 2016 wouldn't match. Note that this change would also make a query like _search?q=describedBy.dateModified:2016 result in zero hits (while this is working atm) - one would have to use the exact whole String. So is this a break?
  2. use "dateModified": { "include_in_all" : false}. This prevents taking the dateModified into account when building the _all index at index level.
acka47 commented 4 years ago

+1 for 1.) as long as range queries still work . I think, @hagbeck is the only one using this field for updating a search index. @hagbeck, do you have a problem with typing this as date in the index profile?

hagbeck commented 4 years ago

We are currently using YYYYMMDD formatted dates in the field based search for dateModified.

Example:

http://lobid.org/resources/search?q=describedBy.dateModified:>20200725+OR+describedBy.dateCreated:>20200725&owner=DE-290&format=bulk

For us it would be possible to change the code easily to

http://lobid.org/resources/search?q=describedBy.dateModified:>2020-07-25T00:00:00.001+OR+describedBy.dateCreated:>2020-07-25T23:59:59.999&owner=DE-290&format=bulk

if necessary.

acka47 commented 4 years ago

We are currently using YYYYMMDD formatted dates in the field based search for dateModified.

These should also work after the change, otherwise we should implement it in another way. @dr0i , please check if these queries will still work.

dr0i commented 4 years ago

@acka47 no, these won't work anymore. So we go with the second approach, yes?

acka47 commented 4 years ago

I got confused. As this issue is about lobid-gnd it won't affect @hagbeck. Sorry for the superfluous ping.

In lobid-gnd these kind of searches do not work while in lobid-resources dateModified is already typed as date. This means that this change will be an improvement to the GND API. So please continue with it, @dr0i.

(As a side note, the date properties from GND ontology are all typed as keyword, see https://github.com/hbz/lobid-gnd/blob/a9bba80a23e26a4c812964424b6c89457e4a3103/conf/index-settings.json#L63-L94. See this commit and related issue #149 for background: https://github.com/hbz/lobid-gnd/commit/15e93bd24c3491cea4e478f5de6fc478487804ca I think this is because not all values conform to date format.)

dr0i commented 4 years ago

uups - they ARE working (forget to escape the query) .But those queries work in a rather unpredictable way, e.g. http://lobid.org/gnd/search?q=describedBy.dateModified%3A%3E30009.

Also note that @hagbeck refers to lobid-resources (which has uses an other date format as lobid-gnd) so we are safe whatsoever.

acka47 commented 4 years ago

those queries work in a rather unpredictable way, e.g. http://lobid.org/gnd/search?q=describedBy.dateModified%3A%3E30009.

Yes, I can not even query by a specific day and get only results for resources modified on that day. I just tried it out when checking whether updates work. E.g. https://lobid.org/gnd/search?q=describedBy.dateModified%3A2020-07-23&size=100&format=html does not give back entries for entries modified on 2020-07-23.

A phrase query doesn't give back any results at all: https://lobid.org/gnd/search?q=describedBy.dateModified%3A%222020-07-23%22

This is too bad and another reason to set this as data in the index profile.

dr0i commented 4 years ago

Deployed to staging. As this new index is based on the new base dump from 2020-06-22 and updates were received to date #258 is fixed as a sideeffect, too.

acka47 commented 4 years ago

Works fine, e.g.

acka47 commented 4 years ago

Furthermore, https://github.com/hbz/lobid-gnd/issues/255 is resolved with this one. Drei auf einen Streich. Wow.

dr0i commented 4 years ago

Note: not one of the two solutions in https://github.com/hbz/lobid-gnd/issues/257#issuecomment-664210752 solved the issue but all two of them.

As this issue is resolved and in production: closing.