hyunwoongko / transformer

Transformer: PyTorch Implementation of "Attention Is All You Need"
2.94k stars 431 forks source link

how to get dataset #12

Open Sun-Happy-YKX opened 1 year ago

Sun-Happy-YKX commented 1 year ago

I'm new to transformer recently and don't know how to get the dataset in this project. Please help me to provide a linux script if you can.

Gi-gigi commented 1 year ago

请问兄弟你解决了嘛?可否进一步交流一下~

Luoxiaofan666 commented 10 months ago

the same question ,Please help me to provide a linux script if you can.

Exiurs commented 8 months ago

https://blog.csdn.net/xunan003/article/details/130110232

Shengqi-Kong commented 5 months ago

https://blog.csdn.net/xunan003/article/details/130110232

链接挂了,直接提示403forbidden,难怪运行也会报错,server直接挂了

JaceJu-frog commented 3 months ago

First you can download dataset into yout own computer: train = wget "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz" valid =wget "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz" test =wget "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt_task1_test2016.tar.gz" and unzip them to any route (just a case "~/Python/DATASETS/Multi30k/") . Then you can use TranslationDataset class to load the data and split them:

from torchtext.datasets import TranslationDataset, Multi30k
ROOT = '~/Python/DATASETS/Multi30k/'
Multi30k.download(ROOT)

(trnset, valset, testset) = TranslationDataset.splits(   
                                      path       = ROOT,  
                                      exts       = ['.en', '.de'],   
                                      fields     = [('src', srcfield), ('trg',tgtfield)],
                                      test       = 'test2016'
                                      )

ref: https://github.com/pytorch/text/issues/312#issuecomment-406092660