Training Script and Model Checkpoint for HotpotQA

hyukkyukang commented 1 year ago

Hello,

I'm currently interested in testing Baleen using the HotpotQA dataset, specifically with the aim of reproducing the results outlined in Table 2 from your published paper.

Could you kindly share the training script that was used for training the model on the HotpotQA dataset? Or could you share the model checkpoint that was used to generate the results shown in Table 2?

Thank you in advance!

okhat commented 1 year ago

Of course! Which is better? I can probably find the checkpoint more quickly but happy to provide either

hyukkyukang commented 1 year ago

The checkpoint will be the best! Nevertheless, I am quite interested in understanding the implementation of latent hop ordering in code. It would be awesome if the training script can be shared as well :)

Thanks for the quick answer!

okhat commented 1 year ago

In a little bit (after the upload is completed), you should be able to download the HotPotQA checkpoints from:

wget https://downloads.cs.stanford.edu/nlp/data/colbert/baleen/unchecked.hotpotqa.checkpoints-v1.0.tar.gz

Notice I kept "unchecked" in the name. I didn't try to test these 25-month old files, but I'm >90% sure these are the right HotPotQA checkpoints.

The HotPotQA corpus is the same as HoVer's. But you'll need the HotPotQA dev queries, which I assume you have.

If you run the indexing and retrieval pipelines (with the compression-based ColBERTv2) and observe the results from the paper (modulo the effect from compression), I'm happy to put this release next to the official HoVer one that's in the README (i.e., and remove this "unchecked" label).

Let me know if you can confirm this. If so, I'll be happy to gather the training scripts that produced these checkpoints.

okhat commented 1 year ago

By the way, I'd be happy to check it myself if you prefer. But I didn't want to block you until I get a chance to do this.

okhat commented 1 year ago

Oh btw when running the pipeline don’t forget to set the number of hops for hotpot to 2. Not 4.

hyukkyukang commented 1 year ago

Thank you so much! I'll check it today and share the result when it's ready!

hyukkyukang commented 1 year ago

I've conducted an evaluation using the provided checkpoint, however, the accuracy appears to be quite low. I'd appreciate any assistance to ascertain if I might have made an error in my process.

Here's a detailed outline of the steps I've taken:

Indexing and Inference:
I used the following scripts for indexing and inference:
- hotpotqa_indexing.py
- hotpotqa_inference.py

Evaluation:

I proceeded with the evaluation as follows:
python evaluation/eval.py --pred_file ./experiments/default/hotpotqa_inference/2023-05/18/11.28.51/hotpotqa_output.json --dev_file ./data/hotpotqa/dev/qas.json --eval_type doc
The script and data used for evaluation include:

eval.py

hotpotqa_output.json

qas.json

Results:

The results of the evaluation are as follows:

{
"total": 7405,
"exact": 50.182309250506414,
"f1": 82.4313687662804,
"hit5": 87.508440243079,
"hit8": 87.73801485482782,
"hit10": 87.88656313301823,
"hit20": 88.6698176907495
}

While I am re-evaluating my steps for potential errors, I would greatly appreciate it if you could also examine it.

okhat commented 1 year ago

So the scripts in the repo are more directly useful for evaluating Psg-EM from Table 2, which for Baleen is 86.7% in the paper. Getting the correct top-20 for hit20 is a bit different, because you don't just want a "bag" of passages, but you want the right split of passages from the first hop and the second hop.

Let's first check Psg-EM. Here's my evaluation logic:


def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

path = '/future/u/okhattab/experiments.Jan26/HotPotQA.Baleen/Condensers.L2/C2.Hn.cv1/inference/b10000.HotPotQA.Baleen.C2.Hn.dev.H2.new10k.cv0/condensed.json'

PsgEM = []
with open(path) as f:
    for line in f:
        example = ujson.loads(line)

        preds = example['prediction']

        # NOTE: your files have probably done some version of this filtering but it may be geared toward HoVer
        preds = sorted(preds, reverse=True) # sort by score
        preds = [x for score, x in preds] # remove the scores
        preds = list(map(tuple, preds[:K])) # at most 5 sentences

        # at most (or rather exactly) two PIDs
        if len(set([pid for pid, _ in preds])) > 2:
            first_two_pids = f7([pid for pid, _ in preds])[:2]
            preds = [(pid, sid) for pid, sid in preds if pid in first_two_pids]

        gold = Dev[example['qid']]['support_facts']
        ceil = list(map(tuple, gold))

        psg_em = set([pid for pid, _ in preds]) == set([pid for pid, _ in gold])
        PsgEM.append(psg_em)

sum(PsgEM)/ len(PsgEM)
# should be 86.7%

I have also checked my file above and it seems to overall match yours in its highest-scoring sentences.

Here is the top of my file (notice it's a different format):

{"qid":0,"question":"Were Scott Derrickson and Ed Wood of the same nationality?","prediction":[[8.234375,[536450,0]],[5.8203125,[967320,0]],[-3.794921875,[2373782,0]],[-9.2421875,[2398774,0]],[-9.3359375,[5032022,1]],[-9.3671875,[1883474,0]],[-9.234375,[1480374,0]],[-9.2265625,[1153807,0]],[-9.375,[430108,0]]]}
{"qid":1,"question":"What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?","prediction":[[8.09375,[2120546,0]],[4.66796875,[4683086,1]],[-9.1875,[967281,0]],[-0.2236328125,[4683086,0]],[-9.2734375,[3195441,1]],[-9.2734375,[1678729,1]],[-9.1171875,[3190546,1]],[-9.046875,[968157,0]],[-9.1171875,[260050,0]]]}
{"qid":2,"question":"What science fantasy young adult series, told in first person, has a set of companion books narrating the stories of enslaved worlds and alien species?","prediction":[[3.802734375,[1249454,1]],[2.923828125,[1249454,0]],[0.56298828125,[126385,0]],[-2.505859375,[157352,0]],[1.6318359375,[176789,2]],[-0.10064697265625,[157352,2]],[-2.232421875,[157352,3]],[-0.270263671875,[176789,0]]]}
{"qid":3,"question":"Are the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?","prediction":[[7.93359375,[1353673,0]],[8.296875,[2687076,0]],[-6.89453125,[607206,0]],[-8.3671875,[504261,0]],[-9.1015625,[3991928,0]],[-8.625,[3051166,0]],[-9.359375,[1723248,0]],[-8.6875,[4127983,0]]]}
{"qid":4,"question":"The director of the romantic comedy \"Big Stone Gap\" is based in what New York city?","prediction":[[8.1171875,[986521,0]],[8.296875,[5114725,0]],[-9.390625,[4705503,0]],[-9.3359375,[693237,0]],[-9.421875,[3530945,0]],[-8.15625,[5114725,1]],[-9.328125,[4768609,0]],[-9.3359375,[4767627,0]],[-9.3125,[1578479,0]]]}
{"qid":5,"question":"2014 S\/S is the debut album of a South Korean boy group that was formed by who?","prediction":[[6.421875,[3184586,0]],[7.48828125,[3333904,0]],[-6.64453125,[4651129,1]],[-8.671875,[4933676,0]],[-9.03125,[2325846,1]],[-8.6640625,[298793,1]],[-7.8359375,[298793,0]],[-8.2734375,[439851,0]],[-8.7890625,[4933676,3]]]}
{"qid":6,"question":"Who was known by his stage name Aladin and helped organizations improve their performance as a consultant?","prediction":[[7.41796875,[2374219,0]],[6.375,[261087,0]],[-8.9375,[4807944,0]],[-7.8671875,[1919128,0]],[-6.671875,[3192216,0]],[-8.875,[4289908,1]],[-8.203125,[1919128,3]],[-8.2421875,[3191341,0]],[-7.93359375,[3192216,2]]]}
{"qid":7,"question":"The arena where the Lewiston Maineiacs played their home games can seat how many people?","prediction":[[7.30859375,[3533981,0]],[3.328125,[3533979,1]],[0.1922607421875,[3533979,0]],[-9.2265625,[3601971,3]],[-9.2734375,[3532376,0]],[-9.3125,[3532378,0]],[-9.265625,[3532846,0]],[-9.28125,[5214971,0]],[-9.1953125,[608235,1]]]}
{"qid":8,"question":"Who is older, Annie Morton or Terry Richardson?","prediction":[[8.4140625,[3509753,0]],[8.546875,[330695,0]],[-9.0390625,[2204182,0]],[-9.2578125,[4436742,0]],[-8.875,[1829529,2]],[-9.2109375,[3530174,0]],[-9.2578125,[3266617,0]],[-9.265625,[1165865,0]],[-9.0859375,[4336666,0]]]}
{"qid":9,"question":"Are Local H and For Against both from the United States?","prediction":[[8.3984375,[4161666,0]],[8.359375,[4699440,0]],[-9.3125,[537904,0]],[-8.96875,[2744539,0]],[-9.2109375,[4444221,0]],[-9.140625,[4707143,0]],[-9.3671875,[1771660,0]],[-9.1953125,[3407789,0]],[-9.1015625,[2205580,0]]]}
{"qid":10,"question":"What is the name of the fight song of the university whose main campus is in Lawrence, Kansas and whose branch campuses are in the Kansas City metropolitan area?","prediction":[[3.8515625,[2202433,2]],[4.02734375,[2202433,1]],[0.83056640625,[2202433,0]],[-0.72998046875,[1277641,0]],[-4.56640625,[955584,0]],[-5.640625,[955584,3]],[-3.55078125,[4869057,0]],[-7.32421875,[839762,0]],[-6.15625,[1412468,0]]]}
{"qid":11,"question":"What screenwriter with credits for \"Evolution\" co-wrote a film starring Nicolas Cage and Te\u0301a Leoni?","prediction":[[7.45703125,[957780,0]],[5.078125,[1360245,1]],[0.55126953125,[1360245,0]],[-8.46875,[1360239,0]],[-8.84375,[3191607,0]],[-9.046875,[2209894,1]],[-9.2734375,[3556249,0]],[-8.5546875,[2957999,0]],[-8.4765625,[2209894,0]]]}
{"qid":12,"question":"What year did Guns N Roses perform a promo for a movie starring Arnold Schwarzenegger as a former New York Police detective?","prediction":[[4.3359375,[1418521,1]],[4.98828125,[541179,1]],[3.794921875,[1418521,0]],[2.240234375,[541179,0]],[-9.0078125,[3264084,1]],[-8.328125,[4314384,1]],[-8.15625,[2198244,0]],[-7.328125,[537904,0]],[-7.14453125,[537904,1]]]}
{"qid":13,"question":"Are Random House Tower and 888 7th Avenue both used for real estate?","prediction":[[8.0703125,[1453219,0]],[2.814453125,[5197046,0]],[0.4072265625,[5197046,2]],[-7.98046875,[937414,0]],[-7.04296875,[4378635,1]],[-4.37109375,[4378635,0]],[-6.39453125,[937414,1]],[-9.2265625,[3407328,1]],[-9.1484375,[3516718,1]]]}
{"qid":14,"question":"The football manager who recruited David Beckham managed Manchester United during what timeframe?","prediction":[[3.947265625,[363674,0]],[2.0546875,[2795318,3]],[-0.419677734375,[2795318,2]],[-3.294921875,[5196977,0]],[-0.296142578125,[2200838,0]],[-8.375,[341602,2]],[-7.34375,[1857379,0]],[-8.7734375,[4159347,2]],[-8.78125,[4693408,3]]]}
{"qid":15,"question":"Brown State Fishing Lake is in a country that has a population of how many inhabitants ?","prediction":[[6.125,[2627373,0]],[-5.7265625,[2198809,0]],[-7.97265625,[967436,1]],[-4.62109375,[2093873,0]],[-7.81640625,[1684547,2]],[-7.0078125,[3267142,6]],[-7.75390625,[968217,2]],[-8.0859375,[2200173,1]],[-9.203125,[5182786,2]]]}
{"qid":16,"question":"The Vermont Catamounts men's soccer team currently competes in a conference that was formerly known as what from 1988 to 1996?","prediction":[[3.873046875,[2039110,1]],[2.529296875,[2039110,0]],[6.3984375,[2377118,1]],[2.361328125,[2377118,0]],[-9.1484375,[2476154,0]],[-9.09375,[1513000,4]],[-9.078125,[375798,1]],[-9.171875,[4766903,1]],[-9.21875,[3917026,2]]]}

Let's get the Psg-EM evaluation to be the same and then I'll share how to evaluate hit20.

okhat commented 1 year ago

Btw I simplified the eval logic quickly now, when copying it here. Just a note to self that the full notebook is at:

/dfs/scratch0/okhattab/Jupyter/Work/2020-Dec/HotPotQA/2021-Apr-Eval.ipynb

okhat commented 1 year ago

Btw Psg-EM may be equivalent to hit2 in your evaluation logic. I didn't check but it seems so.

Based on that, I think your hit2 will be very close to 86.7% since most of the examples in your outputs have 2 PIDs only. But it's worth checking that explicitly.

hyukkyukang commented 1 year ago

I apologize for the late reply. I've been away due to health issue.

I've just had the chance to re-evaluate the Psg-EM using the evaluation logic you kindly provided. I'm pleased to report that I achieved a score of 86.2%, which aligns closely with the 86.7% reported in the original paper.

Thank you for your help!

hyukkyukang commented 1 year ago

Btw, I would greatly appreciate it if you could share the training script when you have some time!

stanford-futuredata / Baleen

Training Script and Model Checkpoint for HotpotQA #5