rubenverboon / skiir

MIT License
1 stars 0 forks source link

Articles relevance calculation #32

Closed wrvangeest closed 9 years ago

wrvangeest commented 9 years ago

Updated

My idea for the score of an article would be as follows:

Sa = SUM( Ria * Rib * p * SCSi ) 

With Sa = Score of article Ria = Relevance of term i in article a Rib = Relevance of term i in article b p = Multiplier for term being in 'selected' paragraph SCSi = SCS score of term i with respect to selected term(s).

Some edge cases:

In SQL, this can be done in two ways:

Either can be done, the difference is probably in performance. It seems the loop would be a bit quicker, but I've no proof.

Method 1:

@ArtId = ID of article with requested text
@InParagraph = Array with entities in 'selected' paragraph
#SCS = Table with SCS scores of all non-selected entities

CREATE TABLE #tempCalc(
  id, ArtId, Score
)

INSERT INTO #tempCalc(id, ArtId, Score)
  (
   SELECT newid, C.ArtId, 0
   FROM Cross AS C
   WHERE C.TermId IN (
     SELECT C1.TermId 
     FROM Cross AS C1
     WHERE C1.ArtId = @ArtId
     )
  ) 

FOR t IN SELECT TermId 
     FROM Cross
     WHERE ArtId = @ArtId
LOOP

  UPDATE #tempCalc
  SET Score = 
   Score + 
     (SELECT C.Relevance
      FROM Cross AS C
      WHERE C.TermId = t
      AND   C.ArtId  = #tempCalc.ArtId
     ) *
     (SELECT SELECT C.Relevance
      FROM Cross AS C
      WHERE C.TermId = t
      AND   C.ArtId  = @ArtId
     ) *
     (IF t = ANY(@InParagraph) THEN 1 ELSE 0.5)
     * 
     (SELECT SCScore FROM #SCS WHERE TermId = t)

   RETURN NEXT t;
END LOOP;

SELECT TOP 4 ArtId 
FROM #tempCalc
SORT BY Score DESC

Outdated Method 2:

SELECT ArtId, 
  SUM(
   SELECT C1.Relevance * C2.Relevance * (IF C1.TID IN @SelectedTerms THEN 1 ELSE 0.5 END IF)
   FROM   Cross AS C1
   LEFT JOIN Cross AS C2 ON C1.TID = C2.TID
   WHERE C1.AID = @ArtId
   AND C1.AID <> C2.AID
  ) AS Score

!! Haven't tested the code yet, since I don't have Postgres here. Will check later. I believe the first query exists already, made by @rubenverboon .

hermanbanken commented 9 years ago

@wrvangeest Can you try to get it to work on a machine where you do have Postgres?

wrvangeest commented 9 years ago

@hermanbanken I will.

An addition to this theory would be to add the SCS Connector score. This score indicates how related two entities are based on relations found between the two. They can be called through a simple API call. Example: http://lod2. inf.puc-rio.br/scs/similarities.json?entity1=db:Barack_Obama&entity2= db:Michelle_Obama. We could use this to find the score between a selected entity and the rest of the entities in the article. This score is inversely related to the selection multiplier s: if a selected term is almost not related to any other entities, s should be very high: articles containing that entity are much more important than articles containing any of the other entities. Conversely, if the entity is very related to the others, chances are that an article about a related entity is helpful: s should be lower.

wrvangeest commented 9 years ago

Another effect we could add: whether an entity is in the same paragraph as selected words or not. Currently, we only identify whether an entity is selected. The rest of the entities considered are those that are in the full article.

hermanbanken commented 9 years ago

Good that you're exploring options. Please also focus on getting the first one to actually work!

wrvangeest commented 9 years ago

@rubenverboon has a hacked version working. That's good enough for now. We have hardly any IR theory in our solution, so I'd rather be able to present a solution that isn't implemented yet, than a proper working simple version.

hermanbanken commented 9 years ago

I partly agree, but we need a simple version because we rely on the demo during our presentation a lot I believe.

wrvangeest commented 9 years ago

Another process we should apply is Maximal Marginal Relevance (from his slides # 68). It enables you to shuffle a result set in order to not only get the most relevant, but also the most diverse results. Example: say a search term is 'oil'. Alchemy might not be able to properly disambiguate and it is possible that our result set shows both 'petroleum' and 'salad dressing' results. And let's say all 'petroleum' results have relevance .98 and 'dressing' results .97. The automation process does not know which of the two it should be, but just shows the 4 highest results: all 'petroleum'. This might be completely off the mark. Instead, MMR should diversify and put both 'petroleum' and 'dressing' results in the top list, increasing the chance that the results are relevant.

Alchemy's disambiguation might solve this before there's even a problem, but a fail-safe, especially one that is from the theory, is a good idea imo.

hermanbanken commented 9 years ago

@wrvangeest, to minimise the open issues I would like to close this issue. Can you move this content to a markdown file? That way something remains after we close the issue. And you documented these things in the process, so win-win :smile:

wrvangeest commented 9 years ago

Done.