google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.14k stars 754 forks source link

Training T5 from scratch on a new language? #269

Closed ritvik1512 closed 4 years ago

ritvik1512 commented 4 years ago

Hi I was wondering if there are any guidelines or documentation as to pre-training T5 from scratch (not just to any particular downstream task) in a new language?

Also is it possible to do the same with PyTorch under the current framework?

Please let me know if this is not the right place to discuss this, thank you!

ritvik1512 commented 4 years ago

I did refer to issue #172 for this, but that just seems to be initializing it for fine-tuning on a specific task?

agemagician commented 4 years ago

Check this: https://github.com/google-research/google-research/tree/master/t5_closed_book_qa

ritvik1512 commented 4 years ago

I'm sorry if I am missing something, but isn't this training specifically for the QA task?

agemagician commented 4 years ago

Check also this: https://github.com/google-research/text-to-text-transfer-transformer/issues/253

ritvik1512 commented 4 years ago

Hi, thanks for the link. Did you guys perfork unsupervised pre-training these models from scratch on this dataset?

Any ideas on how this would shift to a new language? (Also if there is any chance I could make it work with Pytorch)

Sorry about the series of questions, but thanks for the help!

huseinzol05 commented 4 years ago

I pretrained t5 base and small on Malay language (Malaysia), all steps in here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5

Step to generate sentencepiece for this T5 model, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess, no 4.

I increased input and output length to 1024 because our use case to summarize long texts (https://malaya.readthedocs.io/en/latest/Abstractive.html#load-t5) and generate long texts given important contexts (https://malaya.readthedocs.io/en/latest/Generator.html#load-t5).

You can found our t5 model in huggingface, https://huggingface.co/huseinzol05/t5-base-bahasa-cased

I never seen a powerful seq2seq as T5.

ritvik1512 commented 4 years ago

@huseinzol05 I understand it is in tensorflow, but still is extremely helpful and very close to what I am looking for, thanks for taking the time to share the details!

traumasv commented 4 years ago

@ritvik1512 I was able to get the pytorch model going and here's my team's notebook:

https://colab.research.google.com/github/jameschartouni/arabic_translation/blob/google-t5/Model_1.ipynb#scrollTo=UL5yLXs4YJw7

but I'm still trying to figure out how the operative_config.gin should be adjusted when pre-training the pytorch model.

I know that you guys have only put out the API for fine-tuning purposes only but is there a way to correctly set up operative_config.gin for pretraining the HfPyTorch model?

adarob commented 4 years ago

Hey folks. I'm going to work on setting up the unsupervised task to not require the use of gin.

adarob commented 4 years ago

PTAL at #274 and see if it helps.

ritvik1512 commented 4 years ago

@traumasv thanks for sharing the notebook! If I get this correctly, you guys are trying to implement translation for Arabic from scratch?

ritvik1512 commented 4 years ago

Thanks @adarob! If it works well with @traumasv's task, I will be trying to implement the same

traumasv commented 4 years ago

@traumasv thanks for sharing the notebook! If I get this correctly, you guys are trying to implement translation for Arabic from scratch?

Yes that's right

traumasv commented 4 years ago

PTAL at #274 and see if it helps.

@adarob Thank you for such a quick reply and solution!

I tried adding the token_preprocessor functions to my tasks and ran training with the API without the gin file and it looks like there's a binding that's missing for 'denoise'..?

Could this be specified in the task.addTaskRegistry() or in model.train()?

adarob commented 4 years ago

Which binding is missing? Can you share the error message?

On Tue, Jun 23, 2020, 12:21 AM Hyung Bin Park notifications@github.com wrote:

PTAL at #274 https://github.com/google-research/text-to-text-transfer-transformer/pull/274 and see if it helps.

@adarob https://github.com/adarob Thank you for such a quick reply and solution!

I tried adding the token_preprocessor functions to my tasks and ran training with the API without the gin file and it looks like there's a binding that's missing for 'denoise'..?

Could this be specified in the task.addTaskRegistry() or in model.train()?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/269#issuecomment-647897868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2EQXBCWRALYKBKAMUTRYAUS5ANCNFSM4OAR7LAQ .

ritvik1512 commented 4 years ago

@traumasv ah right, I was aiming for a slightly different approach of first pre-training the model on one particular language and then later fine-tuning for downstream tasks, but thanks nonetheless!

traumasv commented 4 years ago

Here's the link to the cell with the error: https://colab.research.google.com/drive/1eOjdqErmzxOED4tbyNddzyCwuqonSvqd#scrollTo=f6f5uUWXWUKw&line=4&uniqifier=1

On Tue, Jun 23, 2020 at 6:09 AM Adam Roberts notifications@github.com wrote:

Which binding is missing? Can you share the error message?

On Tue, Jun 23, 2020, 12:21 AM Hyung Bin Park notifications@github.com wrote:

PTAL at #274 < https://github.com/google-research/text-to-text-transfer-transformer/pull/274

and see if it helps.

@adarob https://github.com/adarob Thank you for such a quick reply and solution!

I tried adding the token_preprocessor functions to my tasks and ran training with the API without the gin file and it looks like there's a binding that's missing for 'denoise'..?

Could this be specified in the task.addTaskRegistry() or in model.train()?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/google-research/text-to-text-transfer-transformer/issues/269#issuecomment-647897868 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAIJV2EQXBCWRALYKBKAMUTRYAUS5ANCNFSM4OAR7LAQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/269#issuecomment-648045002, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7LVBGOD3A2EYTOSOSIKODRYB5MBANCNFSM4OAR7LAQ .

ashispapu commented 4 years ago

Hi, thanks for the link. Did you guys perfork unsupervised pre-training these models from scratch on this dataset?

Any ideas on how this would shift to a new language? (Also if there is any chance I could make it work with Pytorch)

Sorry about the series of questions, but thanks for the help!

@ritvik1512 Hey did you try to do the pre training on one language and then fine tune for down stream tasks. I'm also exploring the same, didn't come across any useful resources. Please let me know if you have done any progress.

ashispapu commented 4 years ago

Here's the link to the cell with the error: https://colab.research.google.com/drive/1eOjdqErmzxOED4tbyNddzyCwuqonSvqd#scrollTo=f6f5uUWXWUKw&line=4&uniqifier=1 On Tue, Jun 23, 2020 at 6:09 AM Adam Roberts notifications@github.com wrote: Which binding is missing? Can you share the error message? On Tue, Jun 23, 2020, 12:21 AM Hyung Bin Park @.***> wrote: > PTAL at #274 > < #274 > > and see if it helps. > > @adarob https://github.com/adarob Thank you for such a quick reply and > solution! > > I tried adding the token_preprocessor functions to my tasks and ran > training with the API without the gin file and it looks like there's a > binding that's missing for 'denoise'..? > > Could this be specified in the task.addTaskRegistry() or in model.train()? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #269 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAIJV2EQXBCWRALYKBKAMUTRYAUS5ANCNFSM4OAR7LAQ > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#269 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7LVBGOD3A2EYTOSOSIKODRYB5MBANCNFSM4OAR7LAQ .

@adarob @traumasv Is there any fix available for this binding issue ? RuntimeError: Required bindings fordenoisenot provided in config: ['noise_mask_fn']

craffel commented 4 years ago

denoise has no argument noise_function, it should use the argument noise_mask_fn (see https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1749). The functools.partial call for denoise, should be changed to

      functools.partial(
          preprocessors.denoise,
          inputs_fn=preprocessors.noise_span_to_unique_sentinel,
          targets_fn=preprocessors.nonnoise_span_to_unique_sentinel,
          noise_density=0.15,
          noise_mask_fn=preprocessors.iid_noise_mask
      )
RTKno1 commented 4 years ago

Hi @craffel I am continuing the work that @traumasv was doing last month, I tried implementing that function into our code, and got training to work. However I then encountered a keyError: input_plaintext in running eval similar to #173 . I also tried that solution to remove the postprocessor and metrics from the tasks, but I then get keyError: translation_en_msa ,which is a task name, for line 438 in hf_model.py. I saw that batches in there is only coded if the task has a metric_fns, so i uncommented that line, but still get the same error. I can't seem to figure this out, could you please take a look? Here is the colab link to the current code, relevant sections are under English to Arabic Task, Levantine to MSA Task, and Maghrib to MSA Task, and train, eval.

Thank you!

Stellakats commented 4 years ago

Hi @craffel I am continuing the work that @traumasv was doing last month, I tried implementing that function into our code, and got training to work. However I then encountered a keyError: input_plaintext in running eval similar to #173 . I also tried that solution to remove the postprocessor and metrics from the tasks, but I then get keyError: translation_en_msa ,which is a task name, for line 438 in hf_model.py. I saw that batches in there is only coded if the task has a metric_fns, so i uncommented that line, but still get the same error. I can't seem to figure this out, could you please take a look? Here is the colab link to the current code, relevant sections are under English to Arabic Task, Levantine to MSA Task, and Maghrib to MSA Task, and train, eval.

Thank you!

Hi! Did you manage to find a solution for this?

RTKno1 commented 3 years ago

Hi @Stellakats ! Yes I did actually, though it may not be what you are looking for. I actually just followed @huseinzol05 task registry setup here: https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/prepare/finetune-summarization.ipynb for the arguments. So I removed the token preprocessor argument. You can view the colab link I posted as well and navigate to the "Arabic to English Task" to see how we add the Task Registry.

hiiamsid commented 3 years ago

@ritvik1512 were you able to implement t5 on non-english language?

PiotrNawrot commented 1 year ago

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.