Vocab.txt - Githubissues

SharmisthaJat commented 6 years ago

Hi,

Thanks for sharing your code. I was not sure what format the vocab.txt should be, as the file was not in the repo, so I tested the code with a vocab.txt with single words on each line. Example:

yellow woods hanging regularize

But, even with a very small vocab file of 53 words, the script ends up using too much memory (order of 200GB RAM) and gets killed in some time during, '$(datadir)/cooccurrence.filtered.bin: filter_glove' step in the Makefile. Am I running the script with the right input? (there are no other errors reported)

lucy3 commented 6 years ago

Which script are you trying to run?

SharmisthaJat commented 6 years ago

Hi, Following are my steps for execution:

1) Run setup.sh to get data 2) Making a small vocab.txt file in folder 'grounding-embeddings/causal' 3) Running grounding-embeddings/causal/Makefile command: make all

Is this correct ?

lucy3 commented 6 years ago

Ah, the casual folder is still under development. Its contents weren't the focus of our paper, but rather a "casual" exploration of future research directions. Let me know if there is something specific that you're looking to do and I'll try to help you out.

SharmisthaJat commented 6 years ago

I was trying to replicate your paper's results for feature fit. Which scripts should I run to get results similar to the one shown in Table 3,4, Figure 1 etc. ? (http://aclweb.org/anthology/W17-2810).

lucy3 commented 6 years ago

Ok! Feature fit scores are calculated by feature_fit.py. You will want to edit "PIVOT" to the word representation you want to use and and "SOURCE" to "mcrae" or "cslb".

SharmisthaJat commented 6 years ago

Hi,

Thanks, in feature_fit.py again there is a vocab.txt required from each of the vector representation folder. Eg. Line 52 in feature_fit.py, is that the vocab file, what should this file be?

SharmisthaJat commented 6 years ago

Hi Lucy,

Can you please clarify the input required for Line 52 in feature_fit.py (vocab.txt).

lucy3 commented 6 years ago

Hi, I'm still in the process of finding this file. I did a bit of cleaning on my personal computer so I do not have it on here. Once I find it, I will let you know. It should contain words and their frequencies, and it seems like it's usually created when someone runs the original GloVe C scripts with a certain flag on a custom dataset, so the public GloVe download does not include this file. I believe my co-author created this file and so I will check with him to see if he has more information. I think this file is just used to load GloVe word embeddings using word2vec tools, and though there are alternative methods for turning GloVe formatting into word2vec formatting, I will try to find this file soon.

Thank you for your patience.

SharmisthaJat commented 6 years ago

Thanks, I appreciate your help :)

lucy3 commented 6 years ago

Could you try removing the fvocab input from load_all_embeddings? You don't actually need it to reproduce production results. I am updating the code and the readme with more information on how to set things up.

Do let me know if you have any other questions.

SharmisthaJat commented 6 years ago

Hi,

I updated the glove embeddings read the code to text file reading and tried running subgraphs/feature_fit.py. But, the code breaks at line 940 with following output:

File "feature_fit.py", line 940, in main clfs = pickle.load(clf_f) EOFError: Ran out of input

the classifier pickle has an issue.

lucy3 commented 6 years ago

I managed to run feature_fit.py all the way through (after cloning this repo, downloading the data from scratch, following the ReadMe, etc). Are you using Python 3?

SharmisthaJat commented 6 years ago

Update: I was using python3.5 to be precise. I reran the file with Python3 and now the error is gensim related

File "feature_fit.py", line 410, in analyze_classifiers all_embeddings.init_sims()

This maybe due to me loading the glove directly, not using gensim. Let me try converting txt to bin and load it with gensim with this repo https://github.com/marekrei/convertvec

lucy3 commented 6 years ago

Hmmm so I am using the code currently in the repo (with KeyedVectors.load_word2vec_format(INPUT, binary=False)). You should make sure the top of your GloVe input has the extra line indicated in the ReadMe. I think I also used the latest version of gensim and other packages when I ran it.

SharmisthaJat commented 6 years ago

Oh, I see, you have updated the code. I have been playing with the old one. Let me update and check.

SharmisthaJat commented 6 years ago

Hi Lucy,

It worked :), thanks for all the help and the interesting paper.

Best, Sharmistha

lucy3 / grounding-embeddings

Vocab.txt #2