Closed wrvangeest closed 9 years ago
@wrvangeest Can you try to get it to work on a machine where you do have Postgres?
@hermanbanken I will.
An addition to this theory would be to add the SCS Connector score. This score indicates how related two entities are based on relations found between the two. They can be called through a simple API call. Example: http://lod2.
inf.puc-rio.br/scs/similarities.json?entity1=db:Barack_Obama&entity2=
db:Michelle_Obama.
We could use this to find the score between a selected entity and the rest of the entities in the article. This score is inversely related to the selection multiplier s
: if a selected term is almost not related to any other entities, s
should be very high: articles containing that entity are much more important than articles containing any of the other entities. Conversely, if the entity is very related to the others, chances are that an article about a related entity is helpful: s
should be lower.
Another effect we could add: whether an entity is in the same paragraph as selected words or not. Currently, we only identify whether an entity is selected. The rest of the entities considered are those that are in the full article.
Good that you're exploring options. Please also focus on getting the first one to actually work!
@rubenverboon has a hacked version working. That's good enough for now. We have hardly any IR theory in our solution, so I'd rather be able to present a solution that isn't implemented yet, than a proper working simple version.
I partly agree, but we need a simple version because we rely on the demo during our presentation a lot I believe.
Another process we should apply is Maximal Marginal Relevance (from his slides # 68). It enables you to shuffle a result set in order to not only get the most relevant, but also the most diverse results. Example: say a search term is 'oil'. Alchemy might not be able to properly disambiguate and it is possible that our result set shows both 'petroleum' and 'salad dressing' results. And let's say all 'petroleum' results have relevance .98 and 'dressing' results .97. The automation process does not know which of the two it should be, but just shows the 4 highest results: all 'petroleum'. This might be completely off the mark. Instead, MMR should diversify and put both 'petroleum' and 'dressing' results in the top list, increasing the chance that the results are relevant.
Alchemy's disambiguation might solve this before there's even a problem, but a fail-safe, especially one that is from the theory, is a good idea imo.
@wrvangeest, to minimise the open issues I would like to close this issue. Can you move this content to a markdown file? That way something remains after we close the issue. And you documented these things in the process, so win-win :smile:
Done.
Updated
My idea for the score of an article would be as follows:
With
Sa
= Score of articleRia
= Relevance of termi
in articlea
Rib
= Relevance of termi
in articleb
p
= Multiplier for term being in 'selected' paragraphSCSi
= SCS score of termi
with respect to selected term(s).Some edge cases:
SCSi
= 1 for two selected termsSCSi
=SCSi1
*SCSi2
* ... *SCSin
for multiple selected terms.SCSi
=(SCSi * 0.6) + 0.4
to not make it super-influential if no relations are found in 2 steps.In SQL, this can be done in two ways:
FOR
loop, looping over termsEither can be done, the difference is probably in performance. It seems the loop would be a bit quicker, but I've no proof.
Method 1:
Outdated Method 2:
!! Haven't tested the code yet, since I don't have Postgres here. Will check later. I believe the first query exists already, made by @rubenverboon .