tagoyal / factuality-datasets

45 stars 4 forks source link

Unable to reproduce generation data #5

Closed zerocstaker closed 3 years ago

zerocstaker commented 3 years ago

Hi Tanya,

I was also trying to reproduce the generation data on XSum, i.e. running the best_ckpt model on XSum and get the corresponding labels, but I am getting mismatch in the number of labels deemed unfactual. Any pointers on how to generate such datasets would be helpful!

I am running on a modified script based on evaluate_generated_outputs.py, which I have attached below. The only change is that I am using Stanza's CoreNLP with Parser 3.6.0 instead of pycorenlp packet, as I am getting errors with the CoreNLP part. I have checked and am using the same parser version. I have marked my changes with comment in the form of # CHANGES!

For the model, I am using DAE_xsum_human_best_ckpt, and running the evaluation with the parameter: python eval_gen_out.py --model_type electra_dae --model_dir DAE_xsum_human_best_ckpt --input_file test.txt

the test file I have tried is both the first data line in the train.tsv as well as the untokenized version from the original XSum dataset:

Original text:

Recent reports have linked some France-based players with returns to Wales."I've always felt - and this is with my rugby hat on now; this is not region or WRU - I'd rather spend that money on keeping players in Wales," said Davies.The WRU provides £2m to the fund and £1.3m comes from the regions.Former Wales and British and Irish Lions fly-half Davies became WRU chairman on Tuesday 21 October, succeeding deposed David Pickering following governing body elections.He is now serving a notice period to leave his role as Newport Gwent Dragons chief executive after being voted on to the WRU board in September.Davies was among the leading figures among Dragons, Ospreys, Scarlets and Cardiff Blues officials who were embroiled in a protracted dispute with the WRU that ended in a £60m deal in August this year.In the wake of that deal being done, Davies said the £3.3m should be spent on ensuring current Wales-based stars remain there.In recent weeks, Racing Metro flanker Dan Lydiate was linked with returning to Wales.Likewise the Paris club's scrum-half Mike Phillips and centre Jamie Roberts were also touted for possible returns.Wales coach Warren Gatland has said: "We haven't instigated contact with the players."But we are aware that one or two of them are keen to return to Wales sooner rather than later."Speaking to Scrum V on BBC Radio Wales, Davies re-iterated his stance, saying keeping players such as Scarlets full-back Liam Williams and Ospreys flanker Justin Tipuric in Wales should take precedence."It's obviously a limited amount of money [available]. The union are contributing 60% of that contract and the regions are putting £1.3m in."So it's a total pot of just over £3m and if you look at the sorts of salaries that the... guys... have been tempted to go overseas for [are] significant amounts of money."So if we were to bring the players back, we'd probably get five or six players."And I've always felt - and this is with my rugby hat on now; this is not region or WRU - I'd rather spend that money on keeping players in Wales."There are players coming out of contract, perhaps in the next year or so… you're looking at your Liam Williams' of the world; Justin Tipuric for example - we need to keep these guys in Wales."We actually want them there. They are the ones who are going to impress the young kids, for example."They are the sort of heroes that our young kids want to emulate."So I would start off [by saying] with the limited pot of money, we have to retain players in Wales."Now, if that can be done and there's some spare monies available at the end, yes, let's look to bring players back."But it's a cruel world, isn't it?"It's fine to take the buck and go, but great if you can get them back as well, provided there's enough money."British and Irish Lions centre Roberts has insisted he will see out his Racing Metro contract.He and Phillips also earlier dismissed the idea of leaving Paris.Roberts also admitted being hurt by comments in French Newspaper L'Equipe attributed to Racing Coach Laurent Labit questioning their effectiveness.Centre Roberts and flanker Lydiate joined Racing ahead of the 2013-14 season while scrum-half Phillips moved there in December 2013 after being dismissed for disciplinary reasons by former club Bayonne.
New Welsh Rugby Union chairman Gareth Davies believes a joint £3.3m WRU-regions fund should be used to retain home-based talent such as Liam Williams, not bring back exiled stars.

Output:

Arc:    [CLS] compound [SEP] new [SEP] welsh [SEP]
Pred:   1
Probs:  0=0.36521634459495544   1=0.6347836852073669
Arc:    [CLS] compound [SEP] welsh [SEP] union [SEP]
Pred:   1
Probs:  0=0.406429260969162     1=0.5935707092285156
Arc:    [CLS] compound [SEP] rugby [SEP] union [SEP]
Pred:   1
Probs:  0=0.2734803855419159    1=0.7265195846557617
Arc:    [CLS] compound [SEP] union [SEP] chairman [SEP]
Pred:   1
Probs:  0=0.30534377694129944   1=0.6946561932563782
Arc:    [CLS] compound [SEP] chairman [SEP] davies [SEP]
Pred:   1
Probs:  0=0.3313113749027252    1=0.6686886548995972
Arc:    [CLS] compound [SEP] gareth [SEP] davies [SEP]
Pred:   0
Probs:  0=0.8658909797668457    1=0.1341089904308319
Arc:    [CLS] nsubj [SEP] davies [SEP] believes [SEP]
Pred:   1
Probs:  0=0.0752885490655899    1=0.9247114658355713
Arc:    [CLS] amod [SEP] joint [SEP] fund [SEP]
Pred:   1
Probs:  0=0.13915422558784485   1=0.8608457446098328
Arc:    [CLS] nummod [SEP] £ [SEP] fund [SEP]
Pred:   1
Probs:  0=0.06259439885616302   1=0.9374055862426758
Arc:    [CLS] compound [SEP] 3. 3 [SEP] m [SEP]
Pred:   1
Probs:  0=0.1014985516667366    1=0.898501455783844
Arc:    [CLS] nummod [SEP] m [SEP] £ [SEP]
Pred:   1
Probs:  0=0.09426180273294449   1=0.9057382345199585
Arc:    [CLS] compound [SEP] wru [SEP] regions [SEP]
Pred:   1
Probs:  0=0.08483617752790451   1=0.9151638746261597
Arc:    [CLS] compound [SEP] regions [SEP] fund [SEP]
Pred:   1
Probs:  0=0.07002974301576614   1=0.9299702644348145
Arc:    [CLS] obj [SEP] fund [SEP] believes [SEP]
Pred:   1
Probs:  0=0.052373096346855164  1=0.9476269483566284
Arc:    [CLS] nsubj : pass : xsubj [SEP] fund [SEP] used [SEP]
Pred:   1
Probs:  0=0.04616169631481171   1=0.9538382887840271
Arc:    [CLS] aux : pass [SEP] be [SEP] used [SEP]
Pred:   1
Probs:  0=0.05535976216197014   1=0.9446402192115784
Arc:    [CLS] xcomp [SEP] used [SEP] believes [SEP]
Pred:   1
Probs:  0=0.08044032007455826   1=0.9195597171783447
Arc:    [CLS] xcomp [SEP] retain [SEP] used [SEP]
Pred:   1
Probs:  0=0.06058729439973831   1=0.9394127130508423
Arc:    [CLS] obl [SEP] home [SEP] based [SEP]
Pred:   1
Probs:  0=0.32555949687957764   1=0.6744405627250671
Arc:    [CLS] amod [SEP] based [SEP] talent [SEP]
Pred:   1
Probs:  0=0.22164691984653473   1=0.7783530950546265

Sent-level pred:        0

Tokenized text:

recent reports have linked some france - based players with returns to wales . ` ` i ' ve always felt - and this is with my rugby hat on now ; this is not region or wru - i ' d rather spend that money on keeping players in wales , ' ' said davies . the wru provides # 2m to the fund and # 1 . 3 m comes from the regions . former wales and british and irish lions fly - half davies became wru chairman on tuesday 21 october , succeeding deposed david pickering following governing body elections . he is now serving a notice period to leave his role as newport gwent dragons chief executive after being voted on to the wru board in september . davies was among the leading figures among dragons , ospreys , scarlets and cardiff blues officials who were embroiled in a protracted dispute with the wru that ended in a # 60m deal in august this year . in the wake of that deal being done , davies said the # 3 . 3 m should be spent on ensuring current wales - based stars remain there . in recent weeks , racing metro flanker dan lydiate was linked with returning to wales . likewise the paris club ' s scrum - half mike phillips and centre jamie roberts were also touted for possible returns . wales coach warren gatland has said : ` ` we have n ' t instigated contact with the players . ` ` but we are aware that one or two of them are keen to return to wales sooner rather than later . ' ' speaking to scrum v on bbc radio wales , davies re - iterated his stance , saying keeping players such as scarlets full - back liam williams and ospreys flanker justin tipuric in wales should take precedence . ` ` it ' s obviously a limited amount of money - lsb - available - rsb - . the union are contributing 60 % of that contract and the regions are putting # 1 . 3 m in . ` ` so it ' s a total pot of just over # 3m and if you look at the sorts of salaries that the . . . guys . . . have been tempted to go overseas for - lsb - are - rsb - significant amounts of money . ` ` so if we were to bring the players back , we ' d probably get five or six players . `
new welsh rugby union chairman gareth davies believes a joint # 3 . 3 m wru - regions fund should be used to retain home - based talent such as liam williams , not bring back exiled stars .

Output:

Arc:    [CLS] amod [SEP] new [SEP] chairman [SEP]
Pred:   1
Probs:  0=0.48605602979660034   1=0.5139439702033997
Arc:    [CLS] compound [SEP] welsh [SEP] rugby [SEP]
Pred:   1
Probs:  0=0.394680917263031     1=0.6053191423416138
Arc:    [CLS] compound [SEP] rugby [SEP] union [SEP]
Pred:   1
Probs:  0=0.26655247807502747   1=0.7334474921226501
Arc:    [CLS] compound [SEP] union [SEP] chairman [SEP]
Pred:   1
Probs:  0=0.3756413459777832    1=0.6243586540222168
Arc:    [CLS] compound [SEP] chairman [SEP] davies [SEP]
Pred:   1
Probs:  0=0.42923763394355774   1=0.5707623362541199
Arc:    [CLS] compound [SEP] gareth [SEP] davies [SEP]
Pred:   0
Probs:  0=0.9057561159133911    1=0.09424389898777008
Arc:    [CLS] nsubj [SEP] davies [SEP] believes [SEP]
Pred:   1
Probs:  0=0.12402770668268204   1=0.8759722113609314
Arc:    [CLS] amod [SEP] joint [SEP] # [SEP]
Pred:   1
Probs:  0=0.2475324124097824    1=0.7524675726890564
Arc:    [CLS] obj [SEP] # [SEP] believes [SEP]
Pred:   1
Probs:  0=0.09619461745023727   1=0.9038053750991821
Arc:    [CLS] nsubj : pass : xsubj [SEP] # [SEP] used [SEP]
Pred:   1
Probs:  0=0.05678100138902664   1=0.943219006061554
Arc:    [CLS] nummod [SEP] 3 [SEP] # [SEP]
Pred:   1
Probs:  0=0.1271098256111145    1=0.8728901743888855
Arc:    [CLS] nummod [SEP] 3 [SEP] m [SEP]
Pred:   1
Probs:  0=0.09759896993637085   1=0.9024010300636292
Arc:    [CLS] compound [SEP] m [SEP] fund [SEP]
Pred:   1
Probs:  0=0.07375533878803253   1=0.9262446761131287
Arc:    [CLS] compound [SEP] wru [SEP] regions [SEP]
Pred:   1
Probs:  0=0.08375474065542221   1=0.9162452220916748
Arc:    [CLS] compound [SEP] regions [SEP] fund [SEP]
Pred:   1
Probs:  0=0.062137406319379807  1=0.9378626346588135
Arc:    [CLS] appos [SEP] fund [SEP] # [SEP]
Pred:   1
Probs:  0=0.10268472135066986   1=0.8973153233528137
Arc:    [CLS] aux : pass [SEP] be [SEP] used [SEP]
Pred:   1
Probs:  0=0.04920430853962898   1=0.9507957100868225
Arc:    [CLS] xcomp [SEP] used [SEP] believes [SEP]
Pred:   1
Probs:  0=0.10511793941259384   1=0.8948820233345032
Arc:    [CLS] xcomp [SEP] retain [SEP] used [SEP]
Pred:   1
Probs:  0=0.057261254638433456  1=0.9427387118339539
Arc:    [CLS] obl [SEP] home [SEP] based [SEP]
Pred:   1
Probs:  0=0.3305603563785553    1=0.6694396138191223

Sent-level pred:        0

As you can tell, both have only one dep considered as unfactual, and I dont think this matches with the output_ids 3 2 5 4 1 0 6

The modified script:

from stanza.server import CoreNLPClient
from train import MODEL_CLASSES
from torch.utils.data import DataLoader, SequentialSampler, TensorDataset
import torch
import numpy as np
from train_utils import convert_examples_to_features
import argparse
from sklearn.utils.extmath import softmax

parser = argparse.ArgumentParser()
parser.add_argument("--model_type", type=str, required=True)
parser.add_argument("--model_dir", type=str, required=True)
parser.add_argument("--max_seq_length", default=512)
parser.add_argument(
    "--input_file",
    type=str,
    required=False,
)
parser.add_argument("--gpu_device", type=int, default=0, help="gpu device")

def clean_phrase(phrase):
    phrase = phrase.replace("\\n", "")
    phrase = phrase.replace("\\'s", "'s")
    phrase = phrase.lower()
    return phrase

def get_tokens(sent):
    parse = nlp.annotate(
        sent,
        properties={
            "annotators": "tokenize",
            "outputFormat": "json",
            "ssplit.isOneSentence": True,
        },
    )
    tokens = [
        (tok["word"], tok["characterOffsetBegin"], tok["characterOffsetEnd"])
        for tok in parse["tokens"]
    ]
    return tokens

def get_token_indices(tokens, start_idx, end_idx):
    for i, (word, s_idx, e_idx) in enumerate(tokens):
        if s_idx <= start_idx < e_idx:
            tok_start_idx = i
        if s_idx <= end_idx <= e_idx:
            tok_end_idx = i + 1
            break

    return tok_start_idx, tok_end_idx

def evaluate_summary(article_data, summary, tokenizer, model, nlp, args):
    # CHANGES! Calling my own function. Only change is this line
    eval_dataset = get_single_features_from_deps_and_context(
        article_data, summary, tokenizer, nlp, args
    )
    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=1)
    batch = [t for t in eval_dataloader][0]
    device = args.device
    batch = tuple(t.to(device) for t in batch)

    with torch.no_grad():
        input_ids, attention, child, head = batch[0], batch[1], batch[2], batch[3]
        mask_entail, mask_cont, num_dependency, arcs = (
            batch[4],
            batch[5],
            batch[6],
            batch[7],
        )
        sent_labels = batch[8]

        inputs = {
            "input_ids": input_ids,
            "attention": attention,
            "child": child,
            "head": head,
            "mask_entail": mask_entail,
            "mask_cont": mask_cont,
            "num_dependency": num_dependency,
            "sent_label": sent_labels,
            "device": device,
        }

        outputs = model(**inputs)
        dep_outputs = outputs[1].detach()
        dep_outputs = dep_outputs.squeeze(0)
        dep_outputs = dep_outputs[:num_dependency, :].cpu().numpy()

        input_full = tokenizer.convert_ids_to_tokens(
            input_ids[0], skip_special_tokens=False
        )
        input_full = " ".join(input_full).replace("[PAD]", "").strip()

        summary = input_full.split("[SEP]")[1].strip()

        print(f"Input Article:\t{input_full}")
        print(f"Generated summary:\t{summary}")

        num_negative = 0.0
        for j, arc in enumerate(arcs[0]):
            arc_text = tokenizer.decode(arc)
            arc_text = arc_text.replace(tokenizer.pad_token, "").strip()

            if arc_text == "":  # for bert
                break

            softmax_probs = softmax([dep_outputs[j]])
            pred = np.argmax(softmax_probs[0])
            if pred == 0:
                num_negative += 1
            print(f"Arc:\t{arc_text}")
            print(f"Pred:\t{pred}")
            print(f"Probs:\t0={softmax_probs[0][0]}\t1={softmax_probs[0][1]}")

        print("\n")
        if num_negative > 0:
            print(f"Sent-level pred:\t0\n\n")
        else:
            print(f"Sent-level pred:\t1\n\n")

# CHANGES! Instead of using the function from train_util annotate with stanza's corenlp. Anything beyond ex = {...} should be the same
def get_single_features_from_deps_and_context(document, summary, tokenizer, nlp, args):

    summary_ann = nlp.annotate(
        summary, annotators=["tokenize", "ssplit", "pos", "depparse"]
    )
    document_ann = nlp.annotate(document, annotators=["tokenize"])

    summary_tok, summary_pos, summary_dep = get_relevant_deps_and_context_from_ann(
        summary_ann
    )

    doc_tok = []
    for tok in document_ann["tokens"]:
        doc_tok.append(tok["word"])
    # ABOVE are the changes
    ex = {
        "input": " ".join(doc_tok),
        "deps": [],
        "context": " ".join(summary_tok),
        "sentlabel": 1,
    }
    for dep in summary_dep:
        ex["deps"].append(
            {
                "dep": dep["dep"],
                "label": 1,
                "head_idx": dep["head_idx"] - 1,
                "child_idx": dep["child_idx"] - 1,
                "child": dep["child"],
                "head": dep["head"],
            }
        )

    dict_temp = {
        "id": 0,
        "input": ex["input"],
        "sentlabel": ex["sentlabel"],
        "context": ex["context"],
    }
    for i in range(20):
        if i < len(ex["deps"]):
            dep = ex["deps"][i]
            dict_temp["dep_idx" + str(i)] = (
                str(dep["child_idx"]) + " " + str(dep["head_idx"])
            )
            dict_temp["dep_words" + str(i)] = str(dep["child"]) + " " + str(dep["head"])
            dict_temp["dep" + str(i)] = dep["dep"]
            dict_temp["dep_label" + str(i)] = dep["label"]
        else:
            dict_temp["dep_idx" + str(i)] = ""
            dict_temp["dep_words" + str(i)] = ""
            dict_temp["dep" + str(i)] = ""
            dict_temp["dep_label" + str(i)] = ""

    features = convert_examples_to_features(
        [dict_temp],
        tokenizer,
        max_length=args.max_seq_length,
        pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
    )

    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    input_attention_mask = torch.tensor(
        [f.input_attention_mask for f in features], dtype=torch.long
    )

    child_indices = torch.tensor([f.child_indices for f in features], dtype=torch.long)
    head_indices = torch.tensor([f.head_indices for f in features], dtype=torch.long)

    mask_entail = torch.tensor([f.mask_entail for f in features], dtype=torch.long)
    mask_cont = torch.tensor([f.mask_cont for f in features], dtype=torch.long)
    num_dependencies = torch.tensor(
        [f.num_dependencies for f in features], dtype=torch.long
    )
    arcs = torch.tensor([f.arcs for f in features], dtype=torch.long)

    sentence_label = torch.tensor(
        [f.sentence_label for f in features], dtype=torch.long
    )

    dataset = TensorDataset(
        all_input_ids,
        input_attention_mask,
        child_indices,
        head_indices,
        mask_entail,
        mask_cont,
        num_dependencies,
        arcs,
        sentence_label,
    )

    return dataset

# CHANGES! Using my own instead of calling from train_util, I am not annotating in this function, but rest should be the same
def get_relevant_deps_and_context_from_ann(ann):
    dep_type = "enhancedDependencies"
    ignore_dep = [
        "punct",
        "ROOT",
        "root",
        "det",
        "case",
        "aux",
        "auxpass",
        "dep",
        "cop",
        "mark",
    ]

    deps = []
    tokens = ann["sentences"][0]["tokens"]
    pos = [tok["pos"] for tok in tokens]
    tokens = [tok["word"] for tok in tokens]

    for dep_dict in ann["sentences"][0][dep_type]:

        if dep_dict["dep"] not in ignore_dep:
            dep_temp = {"dep": dep_dict["dep"]}
            dep_temp.update(
                {
                    "child": dep_dict["dependentGloss"],
                    "child_idx": dep_dict["dependent"],
                }
            )
            dep_temp.update(
                {"head": dep_dict["governorGloss"], "head_idx": dep_dict["governor"]}
            )
            deps.append(dep_temp)
    return tokens, pos, deps

if __name__ == "__main__":
    args = parser.parse_args()

    model_dir = args.model_dir
    model_type = args.model_type

    # CHANGES! Using Stanza's CoreNLP function
    nlp = CoreNLPClient(
        output_format="json",
        properties={
            "parse.model": "stanford-parser-full-2015-12-09.zip",
            "ssplit.isOneSentence": True,
        },
        endpoint="http://localhost:10086",
        be_quiet=True,
    )

    config_class, model_class, tokenizer_class = MODEL_CLASSES[model_type]
    tokenizer = tokenizer_class.from_pretrained(model_dir)
    model = model_class.from_pretrained(model_dir)
    device = torch.device("cuda", args.gpu_device)
    args.device = device

    model.to(device)
    model.eval()

    input_file = open(args.input_file)
    input_data = [line.strip() for line in input_file.readlines()]

    for idx in range(0, len(input_data), 3):
        article_text = input_data[idx]
        summary = input_data[idx + 1]
        print(article_text)
        print(summary)
        evaluate_summary(article_text, summary, tokenizer, model, nlp, args)

Best, David

tagoyal commented 3 years ago

Hi David,

Here is the result i have from running on the same example at my end.

 recent reports have linked some france - based players with returns to wales . ` ` i ' ve always felt - and this is with my rugby hat on now ; this is not region or wru - i ' d rather spend that money on keeping players in wales , ' ' said davies . the wru provides # 2m to the fund and # 1 . 3 m comes from the regions . former wales and british and irish lions fly - half davies became wru chairman on tuesday 21 october , succeeding deposed david pickering following governing body elections . he is now serving a notice period to leave his role as newport gwent dragons chief executive after being voted on to the wru board in september . davies was among the leading figures among dragons , ospreys , scarlets and cardiff blues officials who were embroiled in a protracted dispute with the wru that ended in a # 60m deal in august this year . in the wake of that deal being done , davies said the # 3 . 3 m should be spent on ensuring current wales - based stars remain there . in recent weeks , racing metro flanker dan lydiate was linked with returning to wales . likewise the paris club ' s scrum - half mike phillips and centre jamie roberts were also touted for possible returns . wales coach warren gatland has said : ` ` we have n ' t instigated contact with the players . ` ` but we are aware that one or two of them are keen to return to wales sooner rather than later . ' ' speaking to scrum v on bbc radio wales , davies re - iterated his stance , saying keeping players such as scarlets full - back liam williams and ospreys flanker justin tipuric in wales should take precedence . ` ` it ' s obviously a limited amount of money - lsb - available - rsb - . the union are contributing 60 % of that contract and the regions are putting # 1 . 3 m in . ` ` so it ' s a total pot of just over # 3m and if you look at the sorts of salaries that the . . . guys . . . have been tempted to go overseas for - lsb - are - rsb - significant amounts of money . ` ` so if we were to bring the players back , we ' d probably get five or six players . `
new welsh rugby union chairman gareth davies believes a joint # 3 . 3 m wru - regions fund should be used to retain home - based talent such as liam williams , not bring back exiled stars .
[CLS] amod [SEP] new [SEP] gareth [SEP]
0   5
gold:   0
pred:   0
0.7826827   0.21731725

[CLS] amod [SEP] welsh [SEP] gareth [SEP]
1   5
gold:   0
pred:   0
0.74857944  0.2514206

[CLS] amod [SEP] rugby [SEP] gareth [SEP]
2   5
gold:   0
pred:   0
0.58793193  0.4120681

[CLS] compound [SEP] union [SEP] gareth [SEP]
3   5
gold:   0
pred:   0
0.696159    0.303841

[CLS] compound [SEP] chairman [SEP] gareth [SEP]
4   5
gold:   0
pred:   0
0.6965902   0.30340976

[CLS] nsubj [SEP] gareth [SEP] davies [SEP]
5   6
gold:   0
pred:   0
0.90575594  0.094244026

[CLS] ccomp [SEP] believes [SEP] davies [SEP]
7   6
gold:   0
pred:   1
0.3829971   0.61700296

[CLS] amod [SEP] joint [SEP] fund [SEP]
9   18
gold:   0
pred:   1
0.12907948  0.87092054

[CLS] compound [SEP] # [SEP] fund [SEP]
10  18
gold:   0
pred:   1
0.07523679  0.92476314

[CLS] nummod [SEP] 3. 3 [SEP] fund [SEP]
13  18
gold:   0
pred:   1
0.07547788  0.9245221

[CLS] compound [SEP] m [SEP] fund [SEP]
14  18
gold:   0
pred:   1
0.0737554   0.92624456

[CLS] compound [SEP] wru - regions [SEP] fund [SEP]
17  18
gold:   0
pred:   1
0.06213749  0.9378625

[CLS] nsubjpass [SEP] fund [SEP] used [SEP]
18  21
gold:   0
pred:   1
0.036747348 0.96325266

[CLS] nsubj : xsubj [SEP] fund [SEP] retain [SEP]
18  23
gold:   0
pred:   1
0.038346358 0.9616537

[CLS] ccomp [SEP] used [SEP] believes [SEP]
21  7
gold:   0
pred:   1
0.105117984 0.894882

[CLS] xcomp [SEP] retain [SEP] used [SEP]
23  21
gold:   0
pred:   1
0.057261348 0.94273865

[CLS] amod [SEP] home - based [SEP] talent [SEP]
26  27
gold:   0
pred:   1
0.19978425  0.8002157

[CLS] dobj [SEP] talent [SEP] retain [SEP]
27  23
gold:   0
pred:   1
0.12622716  0.87377286

[CLS] mwe [SEP] as [SEP] such [SEP]
29  28
gold:   0
pred:   1
0.17707714  0.8229228

[CLS] compound [SEP] liam [SEP] williams [SEP]
30  31
gold:   0
pred:   1
0.21542485  0.78457516

sent gold:  -1
sent_pred:  0

The parsing is definitely different, I see that the first arc on my end is : '[CLS] amod [SEP] new [SEP] gareth [SEP]' whereas it is '[CLS] compound [SEP] new [SEP] welsh [SEP]' at your end. My expectation is that this would change the predictions significantly.

Another thing that may be diff (I don't see corresponding code for that above), is how the dependency level labels are converted to token level. This is something I will share in the code release.

Thanks!

Tanya

zerocstaker commented 3 years ago

Hi,

after digging around, I realized it was the incorrect CoreNLP version. Now I am getting the same output. Thanks!