google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.11k stars 753 forks source link

SQuAD 1.1 numbers #112

Closed danyaljj closed 4 years ago

danyaljj commented 4 years ago

TLDR; Been trying to train a model for SQuAD 1.1. I am not able to get numbers comparable to what you had reported in the paper.

Here is my code, if you wanna see more details: https://gist.github.com/danyaljj/7c5de89460a116af857b340a2467da2f

In particular, I have used the following length details: sequence_length={"inputs": 512, "targets": 97} and fine-tuned it for 120000 steps.

My encoding is as follows:

The numbers that I get from this experiment are as follows:

my_task_1000000_predictions
{'exact_match': 52.57332071901608, 'f1': 70.03263477449936}
---------
my_task_1005100_predictions
{'exact_match': 62.12866603595081, 'f1': 77.83681787205431}
---------
my_task_1010200_predictions
{'exact_match': 62.05298013245033, 'f1': 77.71365975153053}
---------
my_task_1015300_predictions
{'exact_match': 62.64900662251656, 'f1': 78.0070055831365}
---------
my_task_1020400_predictions
{'exact_match': 62.72469252601703, 'f1': 77.95798979912198}
---------
my_task_1025500_predictions
{'exact_match': 62.989593188268685, 'f1': 78.00095497152454}
---------
my_task_1030600_predictions
{'exact_match': 62.87606433301798, 'f1': 78.034643304285}
---------
my_task_1035700_predictions
{'exact_match': 62.81929990539262, 'f1': 77.96424887837792}
---------
my_task_1040800_predictions
{'exact_match': 62.86660359508041, 'f1': 77.80454277598548}
---------
my_task_1045900_predictions
{'exact_match': 62.50709555345317, 'f1': 77.80790423925912}
---------
my_task_1051000_predictions
{'exact_match': 62.39356669820246, 'f1': 77.78279734457561}
---------
my_task_1056100_predictions
{'exact_match': 62.39356669820246, 'f1': 77.77221921217445}
---------
my_task_1061200_predictions
{'exact_match': 62.677388836329236, 'f1': 77.77069628452828}
---------
my_task_1066300_predictions
{'exact_match': 62.3368022705771, 'f1': 77.69398497118814}
---------
my_task_1071400_predictions
{'exact_match': 62.090823084200565, 'f1': 77.46260008886401}
---------
my_task_1076500_predictions
{'exact_match': 62.18543046357616, 'f1': 77.50450421781471}
---------
my_task_1081600_predictions
{'exact_match': 61.929990539262064, 'f1': 77.38196759038684}
---------
my_task_1086700_predictions
{'exact_match': 62.005676442762535, 'f1': 77.41855694816749}
---------
my_task_1091800_predictions
{'exact_match': 62.02459791863765, 'f1': 77.48568818610663}
---------
my_task_1096900_predictions
{'exact_match': 61.91106906338695, 'f1': 77.28149551239633}
---------
my_task_1102000_predictions
{'exact_match': 62.015137180700094, 'f1': 77.44828796460594}
---------
my_task_1107100_predictions
{'exact_match': 61.98675496688742, 'f1': 77.4350255728719}
---------
my_task_1112200_predictions
{'exact_match': 61.65562913907285, 'f1': 77.26328109990796}
---------
my_task_1117300_predictions
{'exact_match': 61.97729422894986, 'f1': 77.38007709934928}
---------
my_task_1120000_predictions
{'exact_match': 61.854304635761586, 'f1': 77.4255854814208}

The best numbers are about 10% lower than the numbers reported in your paper (Table 14):

Screen Shot 2020-02-28 at 1 35 52 PM
adarob commented 4 years ago

Which version of t5 are you using? We just fixed a bug in the transformer decoding https://github.com/tensorflow/mesh/pull/55 that could cause this. I still need to update the version, however.

On Fri, Feb 28, 2020, 11:36 AM Daniel Khashabi notifications@github.com wrote:

TLDR; Been trying to train a model for SQuAD 1.1. I am not able to get numbers comparable to what you had reported in the paper.

Here is my code, if you wanna see more details: https://gist.github.com/danyaljj/7c5de89460a116af857b340a2467da2f

In particular, I have used the following length details: sequence_length={"inputs": 512, "targets": 97} and fine-tuned it for 120000 steps.

My encoding is as follows:

  • input: question paragraph
  • output: answer-substring

The numbers that I get from this experiment are as follows:

my_task_1000000_predictions {'exact_match': 52.57332071901608, 'f1': 70.03263477449936}

my_task_1005100_predictions {'exact_match': 62.12866603595081, 'f1': 77.83681787205431}

my_task_1010200_predictions {'exact_match': 62.05298013245033, 'f1': 77.71365975153053}

my_task_1015300_predictions {'exact_match': 62.64900662251656, 'f1': 78.0070055831365}

my_task_1020400_predictions {'exact_match': 62.72469252601703, 'f1': 77.95798979912198}

my_task_1025500_predictions {'exact_match': 62.989593188268685, 'f1': 78.00095497152454}

my_task_1030600_predictions {'exact_match': 62.87606433301798, 'f1': 78.034643304285}

my_task_1035700_predictions {'exact_match': 62.81929990539262, 'f1': 77.96424887837792}

my_task_1040800_predictions {'exact_match': 62.86660359508041, 'f1': 77.80454277598548}

my_task_1045900_predictions {'exact_match': 62.50709555345317, 'f1': 77.80790423925912}

my_task_1051000_predictions {'exact_match': 62.39356669820246, 'f1': 77.78279734457561}

my_task_1056100_predictions {'exact_match': 62.39356669820246, 'f1': 77.77221921217445}

my_task_1061200_predictions {'exact_match': 62.677388836329236, 'f1': 77.77069628452828}

my_task_1066300_predictions {'exact_match': 62.3368022705771, 'f1': 77.69398497118814}

my_task_1071400_predictions {'exact_match': 62.090823084200565, 'f1': 77.46260008886401}

my_task_1076500_predictions {'exact_match': 62.18543046357616, 'f1': 77.50450421781471}

my_task_1081600_predictions {'exact_match': 61.929990539262064, 'f1': 77.38196759038684}

my_task_1086700_predictions {'exact_match': 62.005676442762535, 'f1': 77.41855694816749}

my_task_1091800_predictions {'exact_match': 62.02459791863765, 'f1': 77.48568818610663}

my_task_1096900_predictions {'exact_match': 61.91106906338695, 'f1': 77.28149551239633}

my_task_1102000_predictions {'exact_match': 62.015137180700094, 'f1': 77.44828796460594}

my_task_1107100_predictions {'exact_match': 61.98675496688742, 'f1': 77.4350255728719}

my_task_1112200_predictions {'exact_match': 61.65562913907285, 'f1': 77.26328109990796}

my_task_1117300_predictions {'exact_match': 61.97729422894986, 'f1': 77.38007709934928}

my_task_1120000_predictions {'exact_match': 61.854304635761586, 'f1': 77.4255854814208}

The best numbers are about 10% lower than the numbers reported in your paper (Table 14): [image: Screen Shot 2020-02-28 at 1 35 52 PM] https://user-images.githubusercontent.com/2441454/75589295-3f4a0600-5a2f-11ea-97c5-b6a71a273435.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/112?email_source=notifications&email_token=AAIJV2BUCUB2SLWKGDS7IPDRFF7WLA5CNFSM4K56KHXKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IRHSCOQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2HRWXX77G7CK3DOUP3RFF7WLANCNFSM4K56KHXA .

danyaljj commented 4 years ago

I evaluated two days ago; so, probably an older version.

adarob commented 4 years ago

Can you try rerunning your eval after doing pip install -U mesh-tensorflow==0.1.11?

craffel commented 4 years ago

Hey Daniel, apart from re-running with the fixed decoding in Mesh TF, here are some questions/comments about your code:

  1. What are in the train.tsv and dev.tsv files? Specifically, do your questions also have the context somewhere? It looks like you are just grabbing the question from the TSV file; I don't see where you are getting the context for the question. https://gist.github.com/danyaljj/7c5de89460a116af857b340a2467da2f#file-t5-squad-training-py-L86
  2. Any reason not to just use the SQuAD task? https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/tasks.py#L283
  3. The hparams you are using don't quite match ours - e.g. we used a learning rate of 0.001. You may also need to save checkpoints more often than every 5k steps, depending on the batch size.

I can look into it more if the above discussion doesn't resolve the discrepancy.

danyaljj commented 4 years ago

@adarob I tried the evaluation with mesh-tensorflow==0.1.12 and the numbers are slightly (1-2%) higher (but still not as good as the numbers reported in your paper).

For example, before the fix I got:

my_task_1000000_predictions
{'exact_match': 52.57332071901608, 'f1': 70.03263477449936}

But now I get:

my_task_1000000_predictions
{'exact_match': 55.8561967833491, 'f1': 71.82246865813674}

Or, I used to get:

my_task_1005100_predictions
{'exact_match': 62.12866603595081, 'f1': 77.83681787205431}

Now I get:

my_task_1005100_predictions
{'exact_match': 64.76821192052981, 'f1': 79.10829393799928}

@craffel

  1. I have my train/dev data in pre-processed tsv format; here is a subset: https://gist.github.com/danyaljj/f51388cc4305735474f7c87b6fd5cf35 The data is read through dataset_fn (into a {"question": ..., "answer": ...} format, although question also contains paragraphs). Probably not the most efficient design.

  2. I'm not using your pre-defined tasks since I'm thinking of modifying it (the input/output data) slightly in future experiments.

  3. Got it, will try a smaller learning-rate.

danyaljj commented 4 years ago

Related to Q#1, here are several examples printed in the output of the program, just before doing the main process:

{'inputs_plaintext': b'question: when did luther write a german mass? (martin_luther) in response to demands for a german liturgy, luther wrote a german mass, which he published in early 1526. he did not intend it as a replacement for his 1523 adaptation of the latin mass but as an
 alternative for the "simple people", a "public stimulation for people to believe and become christians." luther based his order on the catholic service but omitted "everything that smacks of sacrifice"; and the mass became a celebration where everyone received the wine as well as 
the bread. he retained the elevation of the host and chalice, while trappings such as the mass vestments, altar, and candles were made optional, allowing freedom of ceremony. some reformers, including followers of huldrych zwingli, considered luthers service too papistic; and moder
n scholars note the conservatism of his alternative to the catholic mass. luther\'s service, however, included congregational singing of hymns and psalms in german, as well as of parts of the liturgy, including luthers unison setting of the creed. to reach the simple people and the
 young, luther incorporated religious instruction into the weekday services in the form of the catechism. he also provided simplified versions of the baptism and marriage services.', 'inputs': array([  822,    10,   116,   410,     3,    40,    76,   189,    49,
        1431,     3,     9, 13692,  3294,    58,    41,  1635,    17,
          77,   834,    40,    76,   189,    49,    61,    16,  1773,
          12,  7328,    21,     3,     9, 13692,  4996,   450,   122,
          63,     6,     3,    40,    76,   189,    49,  2832,     3,
           9, 13692,  3294,     6,    84,     3,    88,  1790,    16,
         778,   627,  2688,     5,     3,    88,   410,    59,  8286,
          34,    38,     3,     9,  3709,    21,   112,   627,  2773,
       14340,    13,     8,     3, 14098,  3294,    68,    38,    46,
        2433,    21,     8,    96,     7, 10296,    15,   151,  1686,
           3,     9,    96, 15727, 22935,    21,   151,    12,   857,
          11,   582,     3, 15294,  7137,   535,     3,    40,    76,
         189,    49,     3,   390,   112,   455,    30,     8,  1712,
       26641,   313,    68,     3,    32, 16030,    96,    15,  8461,
        8052,    24,     3,     7, 20072,     7,    13, 10811,   121,
         117,    11,     8,  3294,  1632,     3,     9,  5216,   213,
         921,  1204,     8,  2013,    38,   168,    38,     8,  4109,
           5,     3,    88, 19346,     8, 16417,    13,     8,  2290,
          11,     3, 12654,   867,     6,   298,  9684,  2462,     7,
         224,    38,     8,  3294, 12646,  4128,     6, 23509,     6,
          11, 19490,   130,   263,  9042,     6,     3,  3232,  4333,
          13,  7252,     5,   128,  5139,   277,     6,   379, 10076,
          13,     3,   107,    83, 16502,   524,     3,   172,  3757,
        4707,     6,  1702,     3,    40,    76,   189,   277,   313,
         396,     3, 16281,  3040,   117,    11,   941, 15120,  2232,
           8, 12205, 27803,    13,   112,  2433,    12,     8,  1712,
       26641,  3294,     5,     3,    40,    76,   189,    49,    31,
           7,   313,     6,   983,     6,  1285, 17368,   138,  8782,
          13, 27770,     7,    11,     3,   102,     7,   138,    51,
           7,    16, 13692,     6,    38,   168,    38,    13,  1467,
          13,     8,  4996,   450,   122,    63,     6,   379,     3,
          40,    76,   189,   277,    73,    23,   739,  1898,    13,
           8,  3935,    15,    26,     5,    12,  1535,     8,   650,
         151,    11,     8,  1021,     6,     3,    40,    76,   189,
          49,     3, 10975,  4761,  8033,   139,     8,   471,  1135,
         364,    16,     8,   607,    13,     8,  9624,  1436,     7,
          51,     5,     3,    88,    92,   937, 24687,  5204,    13,
           8, 27843,    11,  5281,   364,     5,     1]), 'targets_plaintext': b'early 1526', 'targets': array([ 778,  627, 2688,    1])}
{'inputs_plaintext': b"question: the art deco style of glassware is represented by which artist? (victoria_and_albert_museum) the glass collection covers 4000 years of glass making, and has over 6000 items from africa, britain, europe, america and asia. the earliest glassware on display comes from ancient egypt and continues through the ancient roman, medieval, renaissance covering areas such as venetian glass and bohemian glass and more recent periods, including art nouveau glass by louis comfort tiffany and \xc3\x89mile gall\xc3\xa9, the art deco style is represented by several examples by ren\xc3\xa9 lalique. there are many examples of crystal chandeliers both english, displayed in the british galleries and foreign for example venetian (attributed to giuseppe briati) dated c1750 are in the collection. the stained glass collection is possibly the finest in the world, covering the medieval to modern periods, and covering europe as well as britain. several examples of english 16th-century heraldic glass is displayed in the british galleries. many well-known designers of stained glass are represented in the collection including, from the 19th century: dante gabriel rossetti, edward burne-jones and william morris. there is also an example of frank lloyd wright's work in the collection. 20th-century designers include harry clarke, john piper, patrick reyntiens, veronica whall and brian clarke.", 'inputs': array([  822,    10,     8,   768,    20,   509,   869,    13,  1905,
        3404,    19,  7283,    57,    84,  2377,    58,    41,  7287,
        3600,     9,   834,   232,   834,   138,  7041,   834, 25581,
          61,     8,  1905,  1232,  3792,   314,  2313,   203,    13,
        1905,   492,     6,    11,    65,   147,     3, 21987,  1173,
          45, 24040,     6,     3,   115, 10694,    77,     6,     3,
       28188,     6,     3, 23064,    11,     3, 15974,     5,     8,
           3, 16454,  1905,  3404,    30,  1831,   639,    45,  4913,
           3,    15,   122,    63,   102,    17,    11,  3256,   190,
           8,  4913,  3408,     6, 16493,     6,     3,  1536,     9,
         159,     7,   663,  6013,   844,   224,    38, 23082, 12572,
        1905,    11,  3005,  6015,    23,   152,  1905,    11,    72,
        1100,  8811,     6,   379,   768,  4825,  1905,    57, 16585,
         159,  2115,     3,    17,  5982,  6820,    11,  7983,  8770,
       12486,   154,     6,     8,   768,    20,   509,   869,    19,
        7283,    57,   633,  4062,    57,     3,  1536,   154,    50,
       11036,     5,   132,    33,   186,  4062,    13,  6884, 28003,
           7,   321, 22269,     6,  6099,    16,     8,     3,  2160,
          17,  1273, 18035,    11,  2959,    21,   677, 23082, 12572,
          41, 20923,    12,     3, 24930,     7,    15,  6811,     3,
        2160,   144,    23,    61,     3, 14134,     3,    75,  2517,
        1752,    33,    16,     8,  1232,     5,     8, 21815,  1905,
        1232,    19,  3673,     8,  6842,    16,     8,   296,     6,
        6013,     8, 16493,    12,   941,  8811,     6,    11,  6013,
           3, 28188,    38,   168,    38,     3,   115, 10694,    77,
           5,   633,  4062,    13, 22269,   898,   189,    18, 14006,
         160,   138,  4370,  1905,    19,  6099,    16,     8,     3,
        2160,    17,  1273, 18035,     5,   186,   168,    18,  5661,
        6553,    13, 21815,  1905,    33,  7283,    16,     8,  1232,
         379,     6,    45,     8,   957,   189,  2646,    10,     3,
          26,  1841,  7852, 14018,     3,  1859,     7, 10652,     6,
           3,    15,    26,  2239,  5958,    15,    18,  1927,  1496,
          11,    56,    23,   265,  8030,    52,   159,     5,   132,
          19,    92,    46,   677,    13,     3,    89,  6254,     3,
         195,    32,    63,    26,     3,   210,  3535,    31,     7,
         161,    16,     8,  1232,     5,   460,   189,    18, 14006,
        6553,   560,     3,  3272,   651,  6860,  1050,     6,     3,
       27341,  7119,    52,     6,  6234,  5206,     3,    60,    63,
          29, 15945,     7,     6,   548,  4554,     9,     3,   210,
craffel commented 4 years ago

The first thing I'd try would be to match the format of our task, i.e. question: Some question? context: some context. I would be kind of surprised if T5 didn't manage to learn to do your variant of the task pretty quickly though. For reference, the t5-small checkpoint should get a score of 76.3/85.0 on our variant of SQuAD without further fine-tuning. FWIW we got the highest score (79.1/87.2) after fine-tuning for only 4000 steps with 131072 tokens per batch, so you shouldn't need to fine-tune for long.

The other thing that might make a difference is that the fine-tuning runs reported in the final table of the paper were run with the gin param decoder/Unitransformer.loss_denominator = 116736.0 This was added to account for differences in the loss function denominator due to different batch/TPU sizes in pre-training vs. fine-tuning. We ultimately concluded it didn't make any difference. But, off the top of my head this may be one other difference between your script and what we ran, so you could try that too.

danyaljj commented 4 years ago

@craffel I did update the encoding to the format you suggested: question: Some question? context: some context.

ExampleSexamples:

 'targets_plaintext': b'triumphing by a brave defence', 'targets': array([20020,    53,    57,     3,     9, 13414, 13613,     1])}
{'inputs_plaintext': b'question: whose portrait by fran\xc3\xa7ois clouet was included in the jones bequest of 1882?  context: (victoria_and_albert_museum) several french paintings entered the collection as part of the 260 paintin
gs and miniatures (not all the works were french, for example carlo crivellis virgin and child) that formed part of the jones bequest of 1882 and as such are displayed in the galleries of continental art 1600\xe2\x80\x931800, including the 
portrait of fran\xc3\xa7ois, duc dalen\xc3\xa7on by fran\xc3\xa7ois clouet, gaspard dughet and works by fran\xc3\xa7ois boucher including his portrait of madame de pompadour dated 1758, jean fran\xc3\xa7ois de troy, jean-baptiste pater and 
their contemporaries.', 'inputs': array([  822,    10,   822,    10,     3,  2544,  7956,    57,     3,
        6296, 24065,   159,     3,  3903,    76,    15,    17,    47,
        1285,    16,     8,     3,  1927,  1496,    36, 10952,    13,
         507,  4613,    58,  2625,    10,    41,  7287,  3600,     9,
         834,   232,   834,   138,  7041,   834, 25581,    61,   633,
       20609,  9843,  5136,     8,  1232,    38,   294,    13,     8,
           3, 18365,  9843,    11, 20955,     7,    41,  2264,    66,
           8,   930,   130, 20609,     6,    21,   677,   443,    40,
          32,     3,    75,  5927,  7999,     7, 24556,    11,   861,
          61,    24,  5147,   294,    13,     8,     3,  1927,  1496,
          36, 10952,    13,   507,  4613,    11,    38,   224,    33,
        6099,    16,     8, 18035,    13, 23639,   768, 24046,   104,
        2606,  1206,     6,   379,     8,  7956,    13,     3,  6296,
       24065,   159,     6,     3,  4817,     3,  5437,    29, 14163,
          57,     3,  6296, 24065,   159,     3,  3903,    76,    15,
          17,     6,  1807,  1893,    26,   146,   122,    88,    17,
          11,   930,    57,     3,  6296, 24065,   159, 23462,    52,
         379,   112,  7956,    13, 11454,   265,    15,    20, 13092,
           9,    26,  1211,     3, 14134,  1003,  3449,     6,     3,
       26459,     3,  6296, 24065,   159,    20, 10968,    63,     6,
           3, 26459,    18,   115,  6789,   343,    15,  6234,    49,
          11,    70,   975, 13089,    52,  5414,     5,     1]), 'targets_plaintext': b"fran\xc3\xa7ois, duc d'alen\xc3\xa7on", 'targets': array([    3,  6296, 24065,   159,     6,     3,  4817,     3,    26,
          31,   138,    35, 14163,     1])}
{'inputs_plaintext': b'question: who invented the first nuclear reactor?  context: (university_of_chicago) notable faculty in physics have included the speed of light calculator a. a. michelson, elementary charge calculator robert
 a. millikan, discoverer of the compton effect arthur h. compton, the creator of the first nuclear reactor enrico fermi, "the father of the hydrogen bomb" edward teller, "one of the most brilliant and productive experimental physicists of t
he twentieth century" luis walter alvarez, murray gell-mann who introduced the quark, second female nobel laureate maria goeppert-mayer, the youngest american winner of the nobel prize tsung-dao lee, and astrophysicist subrahmanyan chandras
ekhar.', 'inputs': array([  822,    10,   822,    10,   113, 20897,     8,   166,  6414,
       24715,    58,  2625,    10,    41,  7846,   485,   834,   858,
         834,  1436,   658,   839,    61, 14538,  6040,    16,     3,
       11599,    43,  1285,     8,  1634,    13,   659,  9019,     3,
           9,     5,     3,     9,     5,  2278,  3573,   106,     6,
       15468,  1567,  9019,     3,  5840,    49,    17,     3,     9,
           5,  3293,    23,  3304,     6,  2928,    49,    13,     8,
       20248,   106,  1504,   768, 10666,     3,   107,     5, 20248,
         106,     6,     8,  9931,    13,     8,   166,  6414, 24715,
           3,    35,  2234,    32, 10881,    23,     6,    96,   532,
        2353,    13,     8, 20913,  6417,   121,     3,    15,    26,
        2239,     3, 13069,     6,    96,   782,    13,     8,   167,
        6077,    11,  8946, 11082,     3,  6941,     7,   447,   343,
           7,    13,     8, 28985,  2646,   121,   759,     7,     3,
         210,  8818,   491,  4331,   457,     6,  9593,  2866,   873,
         195,    18,  2434,   113,  3665,     8,   546,  6604,     6,
         511,  3955,   150,  2370,    50,  1462,   342,  2774,     9,
         281,    15,  8153,    17,    18, 13726,    49,     6,     8,
       19147, 10211,  4668,    13,     8,   150,  2370,  6441,     3,
          17,     7,   425,    18,    26,     9,    32,    90,    15,
           6,    11,    38,    17, 29006,     7,   447,   343,   769,
       17475,  6820,   152,     3,   524,   232, 15447,   157,  3272,

Still doesn't seem to match your scores:

my_task_1000000_predictions
{'exact_match': 58.8647114474929, 'f1': 74.25659762192635}
---------
my_task_1001100_predictions
{'exact_match': 62.96121097445601, 'f1': 77.24437498115165}
---------
my_task_1002200_predictions
{'exact_match': 63.32071901608325, 'f1': 77.75098096807234}
---------
my_task_1003300_predictions
{'exact_match': 63.90728476821192, 'f1': 78.01636939866796}
---------
my_task_1004400_predictions
{'exact_match': 64.01135288552507, 'f1': 78.36244052573069}
---------
my_task_1005500_predictions
{'exact_match': 63.94512771996216, 'f1': 78.33869388096184}
---------
my_task_1006600_predictions
{'exact_match': 64.00189214758751, 'f1': 78.60742661800172}
---------
my_task_1007700_predictions
{'exact_match': 64.11542100283822, 'f1': 78.59747568759413}
---------
my_task_1008800_predictions
{'exact_match': 64.42762535477767, 'f1': 78.92698051211497}
craffel commented 4 years ago

Hm, so if you feed in the SQuAD validation set in the question: ... context: ... format without doing any updates on the pre-trained checkpoint, you should get a score of 76.3/85.0. You can verify this by using the SQuAD Task directly before trying it with your TSV version. If your validation set does not produce those scores, maybe there is an issue with encoding or data processing that is providing the data to the model in a non-standard way.

craffel commented 4 years ago

Two other thoughts:

  1. Looks like your code is using t5.evaluation.metrics.accuracy. How are you computing the exact_match/f1 scores? You should use the squad metric https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/evaluation/metrics.py#L148
  2. Are you evaluating using multiple answer candidates? The SQuAD validation set (https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json) has multiple allowable candidate answers; if you only are comparing against one of those answers you will get a lower score.
danyaljj commented 4 years ago

You're right! My evaluation was slightly different (I was using only one gold answer to evaluate, but the original evaluation evaluates using 3 gold different gold answers). Apologies for the inconvenience.

And I can see numbers close to what has been reported!

---------
my_task_1006600_predictions
{'exact_match': 77.12393566698202, 'f1': 85.86658766596672}
---------
my_task_1007700_predictions
{'exact_match': 77.11447492904446, 'f1': 85.86113645678287}
---------
my_task_1008800_predictions
{'exact_match': 77.53074739829707, 'f1': 86.16631633427632}
---------
my_task_1009900_predictions
{'exact_match': 77.53074739829707, 'f1': 86.14874692081575}
---------
my_task_1010000_predictions
{'exact_match': 77.51182592242195, 'f1': 85.98866765537501}