Entity matching using wikidata

sumit-agrwl commented 2 years ago

I dont want to use wikipedia for any processing. I just want to use the wikidata for entity matching in different languages. Can you guide me through the steps? I am assuming I need to only work with wikidata

ringgaard commented 2 years ago

I have added a new --wikidata_only flag that you can use for the fuse_items and extract_aliases tasks. This excludes inputs from Wikipedia. Please notice that this means that you will not get entity popularity counts in the alias table.

sumit-agrwl commented 2 years ago

So which steps do I need to run? I have ran till import_wikidata.

Also, my ultimate aim is to given a piece of text like “Who is the president of United States?”, it can extract wiki data from it like “president of United States”. If you can just tell me what needs to be done, it would be helpful. I could see the parse for other stores, but am not able to find any documentation for wikidata as such.

ringgaard commented 2 years ago

You need to run the following tasks in addition to import_wikidata:

sling  compute_fanin fuse_items build_kb extract_aliases build_phrasetab --wikidata_only

This will produce a knowledge base (kb.sling) and a phrase table (phrase-table.repo). You can use the phrase table to look up matching phrases, see https://github.com/ringgaard/sling/blob/master/doc/guide/pyapi.md#phrase-tables.

Since both the knowledge base and the phrase table is in memory is is pretty fast to make lookups. You should be able to look up all subphrases up til a certain length (e.g. 10).

sumit-agrwl commented 2 years ago

Since you have made changes for the flag in the source code, I am assuming I need to build from source.

I cloned the repo and ran setup.sh, but its giving me the error

ln: failed to create symbolic link '/usr/lib/python3.7/site-packages/sling': No such file or directory

After that I tried the below command

I am assuming I need to do this -

If you haven't run the setup.sh script already, you then need to link the sling Python module directly to the Python source directory to use it in "developer mode":

sudo ln -s $(realpath python) /usr/lib/python3/dist-packages/sling

For which I ran

sudo ln -s /usr/bin/python3 /usr/lib/python3/dist-packages/sling

But then sling command is still not working.

ringgaard commented 2 years ago

if your sling directory is /home/bob/sling, I think the ln command should be something like:

sudo ln -s /home/bob/sling/python /usr/lib/python3/dist-packages/sling

You can also just wait until tomorrow, where the changes has been included in the nightly build

sumit-agrwl commented 2 years ago

Thank you for your prompt responses. I was able to run the command!

sumit-agrwl commented 2 years ago

[2022-04-30 16:37:09.474628: F sling/task/task.cc:215] Input config is missing for task fused-items/item-reconciler

ringgaard commented 2 years ago

Seem like the config is not optional for item reconciler. Could I get you to try to add the auxin parameter in kb.py:

      return self.wf.mapreduce(input=items,
                               output=output,
                               mapper="item-reconciler",
                               reducer="item-merger",
                               format="message/frame",
                               params={"indexed": True},
                               auxin={"config": self.recon_config()})

sumit-agrwl commented 2 years ago

Thanks for your prompt reply. Its running.

sumit-agrwl commented 2 years ago

I am not sure if I understood this. My query still lies in the fact, that given a query like "Who is the president of United States?" it can extract "president of United States" as an entity that matches to "President of the United States" (https://ringgaard.com/kb/Q11696). I am hoping there is some kind of sling parser that can do that. But I cannot find any documentation or process to do that. Also, it would be helpful if I can do the entity linking in different languages. I think there is support for that in this project, but I am not able to figure that out. Also, one more question is -

I need to change "Who is the president of United States?" to -

Who is the {entity in different language}? For eg : Who is the [Presidente de los Estados Unidos] ?

(this will be using the aliases in different languages), currently after running the steps that you suggested, I could just get the english name and no aliases.

ringgaard commented 2 years ago

If you want to match entity names in other languages you can use the --language flag when generating the phrase tables ( extract_aliases and build_phrasetab). I should note that Wikidata is not "language-dependent" in the same sense as Wikipedia.

While there is a semantic parser and some entity resolution components in SLING, this is not really going to solve your problem. What you are asking for is really a question-answering system. This is a difficult research problem which many researchers and companies are actively working on. There are no simple solutions, but if you search for this, you will find references to many articles describing different approaches to this problem with each their strengths and weaknesses.

ringgaard / sling

Entity matching using wikidata #15