General usage - Githubissues

rajicon commented 1 year ago

I'm a little confused on how to use this. Should I be running aggregate or spin_evaluate? What does each do?

shiranD commented 1 year ago

Sorry for the confusion. spin_evalaute.py is for GPT and GPT2, and Bert_spin_evalaute.py is for both BERT and RoBerta. After you process the results with the above code, the aggregate code is there to look at the whole picture, breaking it down more clearly by word frequency. IIRC I might have run different parts of the test set in parallel -- creating several intermediate results, so the aggregate compose all of those files to a single concrete results file. Another thing to note is that while the evaluation is on test, the code distinguishes between words that were seen only during training, only during testing, or during both sets referring to the latter ones as 'shared terms'.

rajicon commented 1 year ago

My goal is to replicate the experiment in Figure 1/2, and I have some more questions:

(1) For spin_evaluate, is the input supposed to be Wiki-103? Is there any special preprocessing/formatting needed based on that corpus?

(2) aggregate.py requires 4 files, 'high'+args.sett, 'mid1'+args.sett, 'mide2'+args.sett, 'low'+args.sett . What exactly are these files, and how do I generate/get them? Also, what is mid1 vs mid2 exactly?

If I'm understanding correctly, we call spin_evaluate on Wiki-103 to generate results, and then aggregate with the frequency files to break it down. Is that right? Am I missing anything in this process?

shiranD commented 1 year ago

(1) Yes. Wiki-103 has to have a train and a test set. Make sure you first fine-tune with the train set, and then use this code to evaluate with the test set the the results. No special pre-processing, I used the given tokenizer per model. Notice that wiki-103 is weird in the sense that the the lowest frequency bin has very low number of tokens and types (figure 1 and 2; n=3,419 and n=6.5exp3) this defies Zipfs law and this is attributed to the over-done preprocessing step they applied.

(2) args.sett is likely to be applied to the test/train set. I needed to count both the number of events that occurred in the train set -- this number represents how many times a token was seen during training to associate it with the correct bin (or right label). Then For the test, I need to know how many times a token was correctly and predicted, of how many instances occurred in the test set overall to set the percentange. Unfortunately I can't find the code that produces these mid1, mid2, etc.. but initially I had 4 bins created (10, 100, 1000, and 10k) and later on due to very limited number of 10k word types it didn't make sense so I collapsed it to 3 (described in the paper).

You are right! But there might be another post-processing after evaluate and before aggregate that allocates the test set instances to their appropriate bin (based on training as explained above).

rajicon commented 1 year ago

One more question: how is case handled in frequencies? For example, are "Bird" and "bird" considered different words, or the same word when looking at frequencies? (This is for the cased models, of course)

shiranD commented 1 year ago

There was a tokenization process in which all tokens were lower cased.

shiranD / word_level_evaluation

General usage #2