Precalculation of suggestions

anlausch commented 6 years ago

For the precalculation of suggestions, there are several decisions to be made. Currently, I need opinions on the following questions.

When to generate the suggestions? My impression so far was that everybody agrees on calculating them directly after a resource was created via /saveResource. Would that be okay? Or do we need a trigger somewhere else?
How to generate the suggestions ? Here, we will use the mechanisms we have at hand. But, how should I generate the query string for each bibliographic resource? @lgalke How exactly do you do it in the front end? How many suggestions do we want to save, i.e. what is the specific threshold?
Where and how to store the suggestions for a specific bibliographic entry? My first idea was to directly store it in the particular bibliographicEntry in a special property, e.g. suggestions. This would work for sure, but then we might want to delete the suggestions later on, because then the entry is linked and the suggestions might just be overhead. A second idea is to have a special document collection ("tables" in mongodb) for suggestions. We then have to link to them via the identifiers of the bibliographic resources.
What happens to the precalculated suggestions after an entry was linked? This is maybe something we might want to think about too.

lgalke commented 6 years ago

Hey, we can also have a discussion on that topic. My first impressions:

After saveResource sounds reasonable to me.
For working with precalculated suggestions, it seems to be mandatory to use the same query generation strategy. We can think about unifying it, e.g. shift the code to the back-end and I call it as a service. This way, we would not have to change it in two places when there is a request like including doi ;). Regarding the threshold, probably its best to select a fixed amount (like 3 or 5) per-source.
Attaching the suggestions to the entries is probably not the best solution because it would lead to special treatment after the entry has been resolved and also redundancies. We should keep in mind that multiple entries can (and should) lead to the exact same query. What about using the generated query as a primary key for the lookup table? This would also reduce the load, when e.g. a query-string already has precalculated suggestions, we do not need to retrieve them again.
If we decide for a lookup table from query_string to ext_suggestions, they may become valid again at some point. We still might want to 'flush' this cache at some point, but it might be not the most important thing right now how we do that exactly. For instance you could allocate space for 10000 distinct query strings and when you roll over 10000, start overwriting from the beginning. That would at least prevent it from filling the storage completely.

kleinann commented 6 years ago

I can only comment on 2. It sounds reasonable to use only one query generation strategy, as Lukas says. What exactly is it now?

If DOI exists, search for DOI only
if no DOI exists, and the citation is structured, search for title
if no DOI exists, and the citation is unstructured, search for the complete text of the citation Is that correct? Concerning the question if we should adapt the search strategy, I think we need some more testing. At the moment, we get reliable results only when there is a DOI. With title of fulltext search, resources are often not found, although they are present in GVI or SWB. I haven't seen any results from K10plus yet. See also #289

zuphilip commented 6 years ago

What about using the generated query as a primary key for the lookup table?

:+1: That is the same idea I had.

The maximum of 10.000 precalculated searches sounds IMO a little bit low. E.g. when we have one journal with 4 issues each containing 20 articles each containing 60 references for which we have to perform 2 searches leads already to 9.600 and the maximum is almost reached. I suggest to make the maximum much larger. (You can also calculate how many references we have currently in all active entries in the system, which can be an estimation for the number of searches.)

If I think about the process, then we can also delete all the precalculated searches coming from one reference entry after we finally linked and saved it. Thus, it would make sense to delete the corresponding searches at this point of time, but this may be more complicated to implement.

Anyways, I would like to suggest to have then another option box in the advanced search options in the frontend to force a new search and thereby ignoring any precalculated results. I guess we need something like that for testing and also to tune further with some of the parameters.

lgalke commented 6 years ago

@kleinann yes, this is the exact strategy right now. @zuphilip 10000 was just an arbitrarily picked number, just if we need to limit the memory (which would make sense), we could go for such a strict boundary. With the delete after linking I would be careful, since we can have multiple entries with the same query string, so this would require some kind of 'reference counting'.

Manual re-trigger is a good idea. Probably an additional parameter refresh on the external suggestion service would be a good spot to add it.

anlausch commented 6 years ago

The first version of this feature is available in the backend since yesterday. The strategy for the calculation is as follows:

trigger precalculatation after metadata retrieval (if references are available) and after ocr processing (because then, references should be available)
process in the background: For each reference, create a query string following Annettes strategy, and query the external metadata sources.
save the top 10 suggestions for each query in a separate capped collection (1908450048 byte space --> @zuphilip does it make sense? See calculation below).

For retrieving the suggestions, there is a service which takes a specific bibliographic entry id and calculates the specific query string. Check whether there are matching suggestions in our collection and retrieve it.

capped collections have a specified byte size
db.collection.stats() on production system returned average size of 4241 bytes per object in collection brs
in our paper we write that we have around 600000 references, if we prepare 5% of them at once, we end up with 30000 references to prepare
for which we want max 15 suggestions (at the moment 10) to store, therefore, we can assume 30000 15 4241 bytes = 1908450000 byte
if we devide this by 256 (mongo expects a multiple of this value), we end up with 7454882.8125, i.e. 7454883 * 256= 1908450048 byte

locdb / locdb-frend

Precalculation of suggestions #291