usc-isi-i2 / kgtk-similarity

MIT License
27 stars 6 forks source link

Run similarity offline #5

Open matteomedioli opened 2 years ago

matteomedioli commented 2 years ago

Hi all, thanks for this work. I need to exec a large number of requests to compute similarities between wikidata entities. I'm trying running a simple code snippet without calling the API. Is it possible? I need to compute the entities similarities in the whole Wikipedia corpus faster as possible for a Word Sense Disambiguation problem.

from semantic_similarity import SemanticSimilarity

S = SemanticSimilarity()
print(S.semantic_similarity("Q312", "Q19837", "text"))

I'm receiving FileNotFoundError: [Errno 2] No such file or directory: ./kgtk-similarity/classcounts' or FileNotFoundError: [Errno 2] No such file or directory: './kgtk-similarity/labels'

For now I'm using the API approch with async requests but I need to generate too many HTTP request. Although the API response time is fast, going through the the network it takes too long. Thanks in advance!

saggu commented 2 years ago

Hi @matteomedioli ,

you can set up KGTK Similarity locally. You have to update this config file, in your local repo ,

https://github.com/usc-isi-i2/kgtk-similarity/blob/main/semantic_similarity/config.json

The following parameters should. be updated with a local copy of the files mentioned.

NODE2VEC_EMBEDDINGS": word2vec-model-full-128d-mincnt100.dat
COMPLEX_EMBEDDINGS: wikidata-20210215-dwd-v2-similarity-embed.2021-10-03T12:14.complex.np.mmap
TRANSE_EMBEDDINGS: wikidata-20210215-dwd-v2-similarity-embed.2021-10-03T12:14.transe.np.mmap
TEXT_EMBEDDINGS: resources/wikidata-20210215-dwd-v2-similarity-embed.2021-10-03T12:14.text.np.mmap
COMPLEX_EMB_FAISS_INDEX: wikidata-20210215-dwd-v2-similarity-embed.2021-10-03T12:14.complexemb.faiss.index.nlist=8192.train=10M.idx
GRAPH_CACHE: wikidata-20210215-dwd-v2-similarity-main.2021-10-03T12:02.sqlite3.db

You can download the required files from this location , https://drive.google.com/drive/folders/1SCEHDESe3Jap1Z72dy_CwE1Soc66NqeK?usp=sharing

This will require a substantial server to run as some of these files are huge.

Hope it helps.

matteomedioli commented 2 years ago

Many thanks. I'll try to dowload the whole folders into my server but I'll continue to receiving this:

<!DOCTYPE html><html><head><title>Google Drive - Quota exceeded</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><link href=&#47;static&#47;doclist&#47;client&#47;css&#47;2674426593&#45;untrustedcontent.css rel="stylesheet" nonce="Y/laQShOxyEAONV/TdrEnQ"><link rel="icon" href="//ssl.gstatic.com/images/branding/product/1x/drive_2020q4_32dp.png"/><style nonce="Y/laQShOxyEAONV/TdrEnQ">#gbar,#guser{font-size:13px;padding-top:0px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}
</style><script nonce="8Zim5Lc19B71k5rWvyS3UA"></script></head><body><div id=gbar><nobr><a target=_blank class=gb1 href="https://www.google.it/webhp?tab=ow">Search</a> <a target=_blank class=gb1 href="http://www.google.it/imghp?hl=en&tab=oi">Images</a> <a target=_blank class=gb1 href="https://maps.google.it/maps?hl=en&tab=ol">Maps</a> <a target=_blank class=gb1 href="https://play.google.com/?hl=en&tab=o8">Play</a> <a target=_blank class=gb1 href="https://www.youtube.com/?gl=IT&tab=o1">YouTube</a> <a target=_blank class=gb1 href="https://news.google.com/?tab=on">News</a> <a target=_blank class=gb1 href="https://mail.google.com/mail/?tab=om">Gmail</a> <b class=gb1>Drive</b> <a target=_blank class=gb1 style="text-decoration:none" href="https://www.google.it/intl/en/about/products?tab=oh"><u>More</u> &raquo;</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a target="_self" href="/settings?hl=en_US" class=gb4>Settings</a> | <a target=_blank  href="//support.google.com/drive/?p=web_home&hl=en_US" class=gb4>Help</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://docs.google.com/uc%3Fexport%3Ddownload%26id%3D1Mxa-0OrdMj8pUAXKTRqbPhTyRyHHMrmh&service=writely&ec=GAZAMQ" class=gb4>Sign in</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div><div class="uc-main"><div id="uc-text"><p class="uc-error-caption">Sorry, you can&#39;t view or download this file at this time.</p><p class="uc-error-subcaption">Too many users have viewed or downloaded this file recently. Please try accessing the file again later. If the file you are trying to access is particularly large or is shared with many people, it may take up to 24 hours to be able to view or download the file. If you still can't access a file after 24 hours, contact your domain administrator.</p></div></div><div class="uc-footer"><hr class="uc-footer-divider">&copy; 2022 Google - <a class="goog-link" href="//support.google.com/drive/?p=web_home">Help</a> - <a class="goog-link" href="//support.google.com/drive/bin/answer.py?hl=en_US&amp;answer=2450387">Privacy & Terms</a></div></body></html>

If you check the error is Google Drive - Quota exceeded. Given this I tried to workaround using a python script and a shell script with wget or Google Drive API v3, but I'm still receiving the error. Do you think you can provide me with another way to access these resources? Many thanks!!

saggu commented 2 years ago

Hi @matteomedioli I have contacted our IT support to resolve this.

Just curious how many files could you download before this error started popping up? Is this a daily quota, meaning can you try after 24 hours?

matteomedioli commented 2 years ago

Thanks @saggu, I really appreciate that!

Yes, I tried again today, I received the error message yesterday but nothing change. I read somewhere that for shared resources could be a single quota for all users that have access...but not sure if it's true :)

UPDATE: @saggu I retry today. Seems working! :) Again, thanks for your time!

UPDATE2: I'm receiving Quota exceed now only for this 3 files:

wikidata-20210215-dwd-v2-similarity-embed.2021-10-03T12:14.text.np.mmap
wikidata-20210215-dwd-v2-similarity-embed.2021-10-03T12:14.transe.np.mmap
wikidata-20210215-dwd-v2-similarity-main.2021-10-03T12:02.sqlite3.db.gz

Probably I have limits on GB I can download...since these 3 are the bigger ones...

matteomedioli commented 2 years ago

I tried without the 3 missing file and get missing file exception:

Traceback (most recent call last):
  File "semantic_similarity/test.py", line 1, in <module>
    from semantic_similarity import SemanticSimilarity
  File "/home/m.medioli/kgtk-similarity/semantic_similarity/semantic_similarity.py", line 11, in <module>
    class SemanticSimilarity(object):
  File "/home/m.medioli/kgtk-similarity/semantic_similarity/semantic_similarity.py", line 17, in SemanticSimilarity
    'class':   sm.ClassSimilarity(),
  File "/home/m.medioli/kgtk-similarity/semantic_similarity/similarity_measures.py", line 125, in __init__
    self.N = self.backend.get_max_class_count()
  File "/home/m.medioli/kgtk-similarity/semantic_similarity/kypher.py", line 515, in get_max_class_count
    return backend.get_max_class_count(*args, **kwargs)
  File "/home/m.medioli/kgtk-similarity/semantic_similarity/kypher.py", line 181, in get_max_class_count
    return self.get_class_count(self.WD_ENTITY_CLASS_NODE)
  File "/home/m.medioli/kgtk-similarity/semantic_similarity/kypher.py", line 168, in get_class_count
    self.get_query(name=query_name,
  File "/home/m.medioli/.local/lib/python3.8/site-packages/kgtk/kypher/api.py", line 720, in get_query
    kypher_query = KypherQuery(
  File "/home/m.medioli/.local/lib/python3.8/site-packages/kgtk/kypher/api.py", line 128, in __init__
    self._define(**kwargs)
  File "/home/m.medioli/.local/lib/python3.8/site-packages/kgtk/kypher/api.py", line 191, in _define
    self.kgtk_query = kyquery.KgtkQuery (
  File "/home/m.medioli/.local/lib/python3.8/site-packages/kgtk/kypher/query.py", line 217, in __init__
    store.add_graph(file, alias=alias)
  File "/home/m.medioli/.local/lib/python3.8/site-packages/kgtk/kypher/sqlstore.py", line 833, in add_graph
    self.import_graph_data_via_csv(table, file)
  File "/home/m.medioli/.local/lib/python3.8/site-packages/kgtk/kypher/sqlstore.py", line 902, in import_graph_data_via_csv
    with open_to_read(file) as inp:
  File "/home/m.medioli/.local/lib/python3.8/site-packages/kgtk/kypher/sqlstore.py", line 83, in open_to_read
    return open(file, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/m.medioli/kgtk-similarity/classcounts'
saggu commented 2 years ago

Hi @matteomedioli those three files are required. I have no update in terms of extension of quota from google drive. I'll keep you updated

hucara commented 1 year ago

@saggu I'm also getting the same problem as @matteomedioli even having all the files in the correct folder. Any idea?

kgtk-similarity  | Traceback (most recent call last):
kgtk-similarity  |   File "/src/application.py", line 5, in <module>
kgtk-similarity  |     from semantic_similarity.main import QnodeSimilarity
kgtk-similarity  |   File "/src/semantic_similarity/main.py", line 3, in <module>
kgtk-similarity  |     from semantic_similarity.semantic_similarity import SemanticSimilarity
kgtk-similarity  |   File "/src/semantic_similarity/semantic_similarity.py", line 15, in <module>
kgtk-similarity  |     class SemanticSimilarity(object):
kgtk-similarity  |   File "/src/semantic_similarity/semantic_similarity.py", line 20, in SemanticSimilarity
kgtk-similarity  |     'class': sm.ClassSimilarity(),
kgtk-similarity  |   File "/src/semantic_similarity/similarity_measures.py", line 125, in __init__
kgtk-similarity  |     self.N = self.backend.get_max_class_count()
kgtk-similarity  |   File "/src/semantic_similarity/kypher.py", line 515, in get_max_class_count
kgtk-similarity  |     return backend.get_max_class_count(*args, **kwargs)
kgtk-similarity  |   File "/src/semantic_similarity/kypher.py", line 181, in get_max_class_count
kgtk-similarity  |     return self.get_class_count(self.WD_ENTITY_CLASS_NODE)
kgtk-similarity  |   File "/src/semantic_similarity/kypher.py", line 168, in get_class_count
kgtk-similarity  |     self.get_query(name=query_name,
kgtk-similarity  |   File "/usr/local/lib/python3.9/site-packages/kgtk/kypher/api.py", line 749, in get_query
kgtk-similarity  |     kypher_query = KypherQuery(
kgtk-similarity  |   File "/usr/local/lib/python3.9/site-packages/kgtk/kypher/api.py", line 128, in __init__
kgtk-similarity  |     self._define(**kwargs)
kgtk-similarity  |   File "/usr/local/lib/python3.9/site-packages/kgtk/kypher/api.py", line 193, in _define
kgtk-similarity  |     self.kgtk_query = kyquery.KgtkQuery(
kgtk-similarity  |   File "/usr/local/lib/python3.9/site-packages/kgtk/kypher/query.py", line 225, in __init__
kgtk-similarity  |     store.add_graph(file, alias=alias, index_specs=index_specs, append=append_files)
kgtk-similarity  |   File "/usr/local/lib/python3.9/site-packages/kgtk/kypher/sqlstore.py", line 1038, in add_graph
kgtk-similarity  |     self.add_graph_data(table, file, index_specs=index_specs, append=False)
kgtk-similarity  |   File "/usr/local/lib/python3.9/site-packages/kgtk/kypher/sqlstore.py", line 1113, in add_graph_data
kgtk-similarity  |     self.import_graph_data_via_csv(table, file, append=append)
kgtk-similarity  |   File "/usr/local/lib/python3.9/site-packages/kgtk/kypher/sqlstore.py", line 1188, in import_graph_data_via_csv
kgtk-similarity  |     with open_to_read(file) as inp:
kgtk-similarity  |   File "/usr/local/lib/python3.9/site-packages/kgtk/kypher/utils.py", line 43, in open_to_read
kgtk-similarity  |     return open(file, mode)
kgtk-similarity  | FileNotFoundError: [Errno 2] No such file or directory: '/src/classcounts'
kgtk-similarity exited with code 1
saggu commented 1 year ago

@hucara Looks like you are running the docker version. Did you update the docker-compose.yaml file as mentioed here: https://github.com/usc-isi-i2/kgtk-similarity#docker-installation ?

hucara commented 1 year ago

@hucara Looks like you are running the docker version. Did you update the docker-compose.yaml file as mentioed here: https://github.com/usc-isi-i2/kgtk-similarity#docker-installation ?

Thanks for the fast reply! Yes, I'm trying to run the docker version on AWS. The volume path in docker-compose.yaml is updated and still getting the error.

Still, the error comes from different path: FileNotFoundError: [Errno 2] No such file or directory: '/src/classcounts'

Checking the traceback that I posted earlier I see that there is a line where this crashes: kgtk-similarity | File "/usr/local/lib/python3.9/site-packages/kgtk/kypher/sqlstore.py", line 1188, in import_graph_data_via_csv

But I don't see any csv file in the Google Drive folder. Could this be the problem?

saggu commented 1 year ago

Hi @hucara , yes looks like you are missing a file from the db cache file required. Can you post a list of files you are using?

Also, identify the DB_CACHE file (ends with sqlite3.db) and post the output of the following command:

kgtk --debug query --gc DB_CACHE --show-cache
littleflow3r commented 1 month ago

Hi @saggu Thank you for the great work. I am currently trying to run the PATH api by setting up kgtk locally. Looking at the code, I would need this KGTK edge file: "input_kgtk_edge_file": "", Is it possible for you to share this file, as well? Thanks!