google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.17k stars 756 forks source link

Trying to User BioASQ dataset causes error during fine-tuning #90

Closed ctmckee closed 4 years ago

ctmckee commented 4 years ago

I am following your notebook on context free QA (thanks for setting that up). I am substituting your "natural_question" data set with BioASQ-training7b. I am only using "exact_answers" from BioASQ, with a length of one in the TSV files I create for training and validation. Visually these tsv files seem correct (attached png). Also, all of your cells in the notebook function with my tsv files (and I create a data mixture of the bioASQ and triviaQA). However when I run the model.fintune cell, I get the following error immediately after the Enqueue,Dequeue next (100) batches of data..:

INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed. ERROR:tensorflow:Error recorded from infeed: From /job:worker/replica:0/task:0: {{function_node __inference_Datasetmap<class 'functools.partial'>_1622}} Expect 2 fields but have 1 in record 0 [[{{node DecodeCSV}}]] [[while/IteratorGetNext]]

Any suggestions? exampleOfTSV

ctmckee commented 4 years ago

also, I ran through your full notebook using your defined Data Mixture and all worked great

craffel commented 4 years ago

That error makes it sound like it is looking for a "\t" but not finding it in one of the lines (possibly the first). If you run something like

  import tensorflow as tf
  tf.enable_eager_execution()
  ds = tf.data.TextLineDataset(path_to_your_tsv_file)
  ds = ds.map(
      functools.partial(tf.io.decode_csv, record_defaults=["", ""],
                        field_delim="\t", use_quote_delim=False),
      num_parallel_calls=tf.data.experimental.AUTOTUNE)
  for ex in ds:
    print(ex)
    break

does it produce the same error? If so, you may need to check the format of the TSV and the arguments of tf.io.decode_csv.

ctmckee commented 4 years ago

Hi and Thanks for the quick response. The code you provided does not produce the same error.

It produced:

(<tf.Tensor: id=1249, shape=(), dtype=string, numpy=b'What type of genome, (RNA or DNA, double stranded single stranded) is found in the the virus that causes blue tongue disease?'>, <tf.Tensor: id=1250, shape=(), dtype=string, numpy=b'double stranded, segmented RNA'>)

craffel commented 4 years ago

Hm can you replace the

print(ex)
break

with

pass

and see if it iterates through the full dataset without seeing the error you originally posted?

ctmckee commented 4 years ago

AHA! Thank you. my validation set is ok, but the train set produced the error when it hit this line: Which histone modifications are correlated with transcription elongation? ['H3K36me3'].

I will run a replace to take out "['" and "']"

craffel commented 4 years ago

👍