ipeirotis / WikiSynonyms

Extracts synonyms for various terms, exploiting the redirects between terms in Wikipedia
http://wikisynonyms.ipeirotis.com/
12 stars 3 forks source link

Suggested algorithm to implement #20

Closed ipeirotis closed 11 years ago

ipeirotis commented 11 years ago
  1. Issue case-insensitive query

    1a. If 1 entry returned, keep it, proceed to step 3 1b. If n>1 entries returned 1b.a If only one entry is a non-redirect, keep it, proceed to step 3 1b.b If more than one non-redirects, proceed to Step 2 1c. If no matching entries, return a 204 HTTP code (No content) and a correcsponding message

  2. Issue a case-sensitive query, and fetch the matching term. // We follow this step only if there are more than 1 matches in Step 1 // We expect to either have one exact match, or no matches 2a. If no matching terms, return a 300 HTTP code (multiple choices) and the appropriate json, with the multiple entries from Step 1, and a warning message "Multiple matches found because of ambiguous capitalization of the query. Please query again with one of the returned terms" 2b. If matching term, keep it, proceed to Step 3
  3. At this step, we have a single candidate term to use.

    3a. If the term is a disambiguation page, return 300 HTTP code (multiple choices) and the appropriate json with the entries that appear in the disambiguation page and a warning message: "The entry is a disambiguation page in Wikipedia. Please query again with one of the returned terms."

    3b. If the term is a redirect, then replace term with the redirect term, and repeat Step 3

    3c. If the term is not a redirect, find all the redirect terms that lead to it, and return the terms that redirect to it as synonyms. Return the term as the canonical form in the JSON

georgegg commented 11 years ago

Clarification:

When we clear out to step 3 we have one result (row) --> (sid, stitle, tid, ttitle) where "s" stands for source and "t'' for target. A target page is a primary page, so we know witch is the redirect (source) and witch is the non-redirect (target).

Is it necessary to perform the 3b, 3c check?

ipeirotis commented 11 years ago

3b will be useful in cases where a redirect term leads to another redirect term. Not sure if this is ever the case. If this is never the case, then we just perform 3c.

georgegg commented 11 years ago

I think we concluded that mediawiki self-prevents the redirect-to-redirect relations. Also when we construct the relation table I take precautions for that matter -(I think!?).

So I'll skip 3b for good since there's no need for it at all.

georgegg commented 11 years ago

Taking that mediawiki doesn't allow redirect-to-redirect relation I skip all the clauses that check if page is/is not a redirect.

So I rewrite my initial queries (Case sensitive or not) by doing a GROUP BY tid and that fetching ONLY base pages. Then I do the count check and disambiguation check (according to algorithm) and now I think our results are more user friendly and conclusive.

You can check it in wikisyno/?action=ajax_v2term=YOUR TERM

Some case studies for you are: Ajax, ajax, aJaX, c plus plus, c pound, settlement.

I also have added the 2 CS columns and I'm going to test speed now --> pending update!!!

georgegg commented 11 years ago

Update and Indexing of the CS columns in page_relation table have improved speed significantly!