haonan-li / MultiSpanQA

MultiSpanQA: A Dataset for Multi-Span Question Answering
26 stars 10 forks source link

How to get multi-span squad format data? #3

Closed liulizuel closed 1 year ago

liulizuel commented 1 year ago

I have read your single-span squad format code, but it doesn't generate what I want. I create a function like this:

def get_squad_format_data(multi_span_data):
    squad_format_train = []

    for data in multi_span_data:
        obj = {}
        obj['question'] = " ".join(data['question'])
        obj['id'] = data['id']
        obj['is_impossible'] = False
        answers = []
        if "label" in data.keys():
            assert len(data['context']) == len(data['label'])
            temp_ans = ''
            temp_index = -1
            for index in range(len(data['context'])):
                if data['label'][index] == 'B':
                    temp_ans = data['context'][index]
                    temp_index = index
                elif data['label'][index] == 'I':
                    temp_ans += " " + data['context'][index]
                else:
                    if temp_ans != '':
                        answers.append({"text": temp_ans, "answer_start": temp_index})
                        temp_ans = ''
                        temp_index = -1

            if temp_ans != '':
                 answers.append({"text": temp_ans, "answer_start": temp_index})

        obj['answers'] = answers
        context = " ".join(data['context'])
        print(context)
        squad_format_train.append({"context":context, "qas":[obj], })
    return squad_format_train

and I got multi-span squad format data, like this:

1 {                                                                                                                                                                       
    2   "data": [
    3     {
    4       "paragraphs": [
    5         {
    6           "context": "In 1981 , a remake by British artists Dave Stewart and Barbara Gaskin was a UK number one hit single for four weeks and was also a major hit in Au      stria ( # 3 ) , Germany ( # 3 ) , the Netherlands ( # 20 ) , New Zealand ( # 1 ) , South Africa ( # 3 ) and Switzerland ( # 6 ) . The track reached # 72 in the US . Thi      s was the first version of the song to reach # 1 in the UK . The video for the Stewart / Gaskin version contained a cameo by Thomas Dolby as Johnny , Judy being played       by Gaskin in a blond wig .",
    7           "qas": [
    8             {
    9               "answers": [
   10                 {
   11                   "answer_start": 8,
   12                   "text": "Dave Stewart"
   13                 },
   14                 {
   15                   "answer_start": 11,
   16                   "text": "Barbara Gaskin"
   17                 }
   18               ],
   19               "id": "zbij8e4070dp55kvnbgm",
   20               "is_impossible": false,
   21               "question": "who sang it's my party and i'll cry if i want to in the eighties"
   22             }

But, when I start to train, I got these errors,

Could not find answer: 'photography began on' vs. 'Manitoba'
Could not find answer: 'case ä ) is a character that represents either a' vs. 'a letter from several extended Latin alphabets'
Could not find answer: '( NAFTA' vs. 'Canada'
Could not find answer: 'of the NBA , who served' vs. 'Wilt Chamberlain'
Could not find answer: 'stage of World' vs. 'Hiroshima'
Could not find answer: 'British colonies , which became' vs. 'the executive branch'
Could not find answer: 'by American country music' vs. 'John and T.J. Osborne'
Could not find answer: 'younger years , Alan Turner' vs. 'Stephen Marchant'
Could not find answer: 'Gagan Narang .' vs. 'P.V. Sindhu'
Could not find answer: 'the 2017 -- 18 FA Cup' vs. 'Manchester United'
Could not find answer: 'exchanged for' vs. 'casino cage'
Could not find answer: 'Me , I 'm Falling' vs. 'Don Robertson'
Could not find answer: 'a pyramid' vs. 'esteem'
Could not find answer: 'doctors and employees' vs. 'amyloidosis'
Could not find answer: 'The Dock of' vs. 'Otis Redding'
Could not find answer: 'song written' vs. 'Vanity 6'
Could not find answer: 'Again '' is a' vs. 'Lady Gaga'
Could not find answer: 'paise coins' vs. '1966'
Could not find answer: 'iceberg during its maiden voyage from Southampton' vs. 'some of the wealthiest people in the world'
Could not find answer: 'called an `` IRA Plus '' , the idea' vs. 'Senator Bob Packwood of Oregon'
Could not find answer: 'government debt is the' vs. 'the general public'
Could not find answer: 'woman , played' vs. 'Janet Gaynor'
Could not find answer: 'the Aubrey Holes' vs. 'Preseli Hills'
Could not find answer: 'a Bowl of Cherries' vs. 'Ray Henderson'
Could not find answer: 'feel sometimes' vs. 'Drake'
Could not find answer: 'Jungle Movie , has' vs. 'Toran Caudell'
Could not find answer: 'up to an object' vs. 'Kelsey Grammer'
Could not find answer: 'animated Pixar' vs. 'Miguel'
Could not find answer: 'level programming language is a programming' vs. 'use natural language elements'
Could not find answer: 'replacements' vs. 'milk'
Could not find answer: 'nerve cell , is an electrically' vs. 'the central nervous system'
Could not find answer: 'Platinum - certified' vs. 'Joe Cocker'
Could not find answer: 'the NBC soap opera' vs. 'Peter Reckell'
Could not find answer: 'U.S. House elections' vs. 'Andrew Gillum'
Could not find answer: 'the Same '' is a' vs. 'Camila Cabello'
Could not find answer: 'physicians Axel' vs. '1928 to 1932'
Could not find answer: 'improvisational' vs. 'Peter Grosz'
Could not find answer: 'the Brazilian' vs. 'Chad & Jeremy'
Could not find answer: 'and third seasons' vs. 'The Wellingtons'
Could not find answer: 'Comin ' to Town '' is' vs. 'John Frederick Coots'
Could not find answer: 'apparently still alive as' vs. 'deposition by Odoacer'
Could not find answer: 'to enforce integration and' vs. '101st Airborne Division'
Could not find answer: 'adopted a' vs. 'Sarangi'
Could not find answer: 'nuclear deal framework was a preliminary' vs. 'Islamic Republic of Iran'
Could not find answer: 'Amendment I ) to the United States' vs. 'free exercise of religion'
Could not find answer: 'two lines of this rhyme can be found in The Little Mother' vs. 'The Little Mother Goose , published in the US in 1912'
Could not find answer: 'due to be a 2005' vs. 'David Hasselhoff'
Could not find answer: 'Chosen Land ;' vs. 'Julián Felipe'
Could not find answer: 'Entertainment ,' vs. 'Alyson Court'
Could not find answer: ', allowing expeditionary' vs. 'Puerto Rico'
Could not find answer: 'accidentally caught an 80 kg ( 180 lb ) , 93 cm ( 37 in ) long' vs. 'shallow , coastal waters with lush seagrass beds'
Could not find answer: 'gained them a cult following . During' vs. 'firebombing of Dresden , Germany'
Could not find answer: ''' is a song' vs. 'Zac Barnett'
Could not find answer: 'used in Christendom' vs. 'clergy'
Could not find answer: 'Cinematographer Roger Deakins stated' vs. 'Canton , Mississippi'
Could not find answer: 'Washington Movement (' vs. 'A. Philip Randolph'
Could not find answer: 'used many simple' vs. 'surveying'
Could not find answer: 'money for equipment and salaries . In' vs. 'The National Institutes of Health'
Could not find answer: 'the television broadcast' vs. 'the Green Bay Packers'
Could not find answer: 'Brothers ,' vs. 'a father'
Could not find answer: 'summer of 1959' vs. 'Rizzo'
Could not find answer: 'U.S. naval power' vs. 'Puerto Rico'
Could not find answer: 'Cricket World' vs. 'England'
Could not find answer: 'their departed female singer , Signe' vs. 'Alice 's Adventures in Wonderland'
Could not find answer: 'Morty is an American' vs. 'Justin Roiland'
Could not find answer: 'the Abbey Victoria' vs. 'Mary Jane Girls'
Could not find answer: 'less' vs. '1978'
Could not find answer: 'finance , Net' vs. 'Sales'
Could not find answer: '( père' vs. 'France'
Could not find answer: 'largest airline alliance' vs. 'Dallas / Fort Worth'
Could not find answer: '`` Friday' vs. 'Hardwick'
Could not find answer: 'Jewell ( born Richard' vs. 'police officer'
Could not find answer: 'Fault ,' vs. 'New'
Could not find answer: 'American late' vs. 'Tom Snyder'
Could not find answer: 'Made Me Do ''' vs. 'Taylor Swift'
Could not find answer: 'and Lillian' vs. 'Larry King'
Could not find answer: '1993 , the city of Fort Myers in Lee County , Florida' vs. 'Fort Myers in Lee County , Florida , United States'
Could not find answer: 'the television broadcast' vs. 'the Green Bay Packers'
Could not find answer: 'baseball bats ,' vs. 'Pat Fraley'
Could not find answer: 'District . It is the 69th' vs. 'Vornado Realty Trust'
Could not find answer: 'rights and entitlements' vs. 'nineteenth century'
Could not find answer: 'beneficial to humans' vs. 'December'
Could not find answer: 'same name by Stephen King . The screenplay' vs. 'Riverdale neighborhood of Toronto'
Could not find answer: ', becoming' vs. 'Will Smith'
Could not find answer: 'du monde de la' vs. 'United States'
Could not find answer: 'intermediate' vs. '1969'
Could not find answer: 'The mountains' vs. 'Whernside'
Could not find answer: 'First , the Last' vs. 'Barry White'
Could not find answer: 'off award created' vs. 'Diego Maradona'
Could not find answer: 'on a Jet Plane' vs. 'John Denver'
Could not find answer: 'Joshua Rudoy , Lainie Kazan , and Kevin' vs. 'Cascade Range of Washington state'
Could not find answer: 'Rollin ' Stone '' is' vs. 'Norman Whitfield'
Could not find answer: 'magnification , invade humans ,' vs. 'Girolamo Fracastoro'
Could not find answer: 'Gonna Take a Miracle' vs. 'Teddy Randazzo'
Could not find answer: 'Mexico' vs. 'Olivia'
Could not find answer: 'musical tragedy film' vs. 'Natalie Wood'
Could not find answer: 'is the `` Tactical' vs. 'Erebonian Empire'
Could not find answer: 'song by the' vs. 'Phil Lesh'
Could not find answer: 'reading the script' vs. 'Iceland'
Could not find answer: 'Geegland hosted' vs. 'George'
Could not find answer: 'du monde de la' vs. 'United States'
Could not find answer: 'is the seventh self' vs. 'Daniel Johnston'
Could not find answer: 'took place on location , in' vs. 'Buckhead area of Atlanta'
Could not find answer: 'added Middle Eastern' vs. 'the Beach Boys'
Could not find answer: 'April 23' vs. '1971'
Could not find answer: 'Port ( born March 4 , 1985' vs. 'television personality'
Could not find answer: 'Barry Fitzgerald' vs. 'County Mayo'
Could not find answer: 'Ancient' vs. 'Babylon'
Could not find answer: 'United States have been' vs. 'Abraham Lincoln'
Could not find answer: 'member' vs. 'Denmark'
Could not find answer: 'Feel Like Christmas' vs. 'Gwen Stefani'
Could not find answer: 'Picture Soundtrack' vs. 'Elton John'
Could not find answer: 'of the circulatory' vs. 'arteries'
Could not find answer: 'fictional character' vs. 'Idina Menzel'
Could not find answer: 'two were marketed as the' vs. 'New England Patriots'
Could not find answer: 'statements drafted' vs. 'states ' rights'
Could not find answer: 'fictional character' vs. 'Blair Redford'
Could not find answer: 'thin and bony' vs. 'Diana Rigg'
Could not find answer: 'history can be largely' vs. 'Thomas Carlyle'
Could not find answer: 'Hollywood Walk of' vs. 'terrazzo'
Could not find answer: 'learn the idea had previously been used . Harry' vs. 'Calabasas , Los Angeles County , California'
Could not find answer: 'sequel to The Curse' vs. 'Palos Verdes'
Could not find answer: 'propaganda tool and' vs. 'The Plain'
Could not find answer: 'to appear' vs. 'Kid Cudi'
Could not find answer: 'rift running' vs. 'basalt'
Could not find answer: 'role . She said that Jane' vs. 'Hancock Park , Los Angeles'
Could not find answer: ''s Watching' vs. 'Rockwell'
Could not find answer: 'Jackson ) is an American feature film series based on the' vs. 'Percy Jackson & the Olympians : The Lightning Thief'
Could not find answer: 'Eastern Division or AFC' vs. 'the Buffalo Bills'
Could not find answer: 'Take It with You is a' vs. 'George S. Kaufman'
Could not find answer: 'Your Hand '' is' vs. 'John Lennon'
Could not find answer: 'romantic comedy' vs. 'Johnny Depp'
Could not find answer: 'Recording Academy ( formerly the National Academy' vs. 'National Academy of Recording Arts and Sciences'
Could not find answer: '24 July' vs. 'Russia'
Could not find answer: 'Santa Claus' vs. 'Dasher'
Could not find answer: 'Manuel de Salcedo' vs. 'James Monroe'
Could not find answer: 'Thorne , Kate Dickie' vs. 'Northern Ireland'
Could not find answer: 'fictional superhero' vs. 'Stan Lee'
Could not find answer: 'afflicted' vs. 'Isaiah 58'
Could not find answer: 'subtitled ``' vs. 'Bob Wells'
Could not find answer: 'as M *' vs. '1970'
Could not find answer: 'recorded by Russian' vs. 'Jordan K. Johnson'
Could not find answer: 'Olympic Winter' vs. '2006'
Could not find answer: 'track on Pink Floyd' vs. 'David Gilmour'
Could not find answer: 'recording artist' vs. 'Toni Braxton'
Could not find answer: 'is a high' vs. 'Avignon'
Could not find answer: 'in celebration . It is culturally accepted' vs. 'random death and injury from stray bullets'
Could not find answer: 'the Timon' vs. 'Shenzi'
Could not find answer: 'an American football' vs. 'Braxton Miller'
Could not find answer: 'next Governor' vs. 'Mike DeWine'
Could not find answer: 'abdicated and' vs. 'Germany'
Could not find answer: 'abbreviated as WDW ) is an American' vs. 'Jonah Marais Roth Frantzich'
Could not find answer: 'several different climate' vs. 'equatorial climate'

I don't know whether my code generated the answer text and answer start is right or not? Could you help me, please?

haonan-li commented 1 year ago

Hi,

There are two issues:

  1. In SQuAD, the start position of an answer is the character-level position, but it seems you implemented with token-level position.
  2. I find in your generated context, there are some unexpected long spaces (e.g., "Au stria"), I don't think there should be these.

I suggest to modify our code for your purpose. Anyway, Hope these help you.

liulizuel commented 1 year ago

Thanks for your reply. But the generated text is wrongly copied from a terminal console, so it confused you. Anyway, you remind me of "character-level position", I think it is the problem. Thank you very much again~