neulab / knn-transformers

PyTorch + HuggingFace code for RetoMaton: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including an implementation of kNN-LM and kNN-MT
MIT License
269 stars 23 forks source link

The size of the dstore #6

Closed YsylviaUC closed 1 year ago

YsylviaUC commented 2 years ago

Hi, I'm wondering how to set the _knn_args.dstoresize if I use my own data to construct the datastore?

urialon commented 2 years ago

Hi @YsylviaUC ! Thank you for your interest in our work.

I just pushed a commit that sets dstore_size automatically for you, according to your training set size, if you just don't pass any value to this flag.

You will need to know this number when you load the saved datastore later. You can find which size was saved according to the file name of the datastore. For example, if the saved file is called dstore_gpt2_116988150_768_vals.npy, the size is 116988150. This number will also be printed when you save the datastore, as:

09/14/2022 11:01:00 - INFO - __main__ - [train] Total eval tokens: 116988150

Let me know if you have any questions or problems, Best, Uri

YsylviaUC commented 1 year ago

Hi @YsylviaUC ! Thank you for your interest in our work.

I just pushed a commit that sets dstore_size automatically for you, according to your training set size, if you just don't pass any value to this flag.

You will need to know this number when you load the saved datastore later. You can find which size was saved according to the file name of the datastore. For example, if the saved file is called dstore_gpt2_116988150_768_vals.npy, the size is 116988150. This number will also be printed when you save the datastore, as:

09/14/2022 11:01:00 - INFO - __main__ - [train] Total eval tokens: 116988150

Let me know if you have any questions or problems, Best, Uri

Hello, how long does it cost to Building the FAISS index using such big training vectors(>100G)?

urialon commented 1 year ago

It depends on the number of CPU cores. If you can use more, the code will use them.

I think a few hours.