ohmeow / blurr

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.
https://ohmeow.github.io/blurr
Apache License 2.0
289 stars 34 forks source link

Causal Language Modelling from files #47

Closed tgalery closed 2 years ago

tgalery commented 3 years ago

Hi @ohmeow , this is probably a basic question, but I'm having some issues to do causal language modelling from a set of wikitext-100 style files. My data folder is split as follows:

data/
       - train/
           - file1.txt
           - file2.txt
           - file3.txt
           - ..... 
       - valid/
           - file1.txt
           - file2.txt
           - file3.txt
           - .....

So far I have been following your gpt2 tutorial for LM:

pretrained_model_name = "gpt2"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=AutoModelForCausalLM)
if (hf_tokenizer.pad_token is None): hf_tokenizer.pad_token = '[PAD]'
before_batch_tfm = HF_LMBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=CausalLMStrategy)
blocks = [HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_CausalLMInput), noop]
path = config["data_path"]
get_items = partial(get_text_files, folders=[train, valid])
dblock = DataBlock(blocks=blocks, get_items=get_items, get_y=None, splitter=config["splitter"])
dls = TextDataLoaders.from_dblock(dblock, path, path=path, seq_len=config["max_seq_len"])

Config is a dict containing various params and a splitter class that's able to get the validation / train sets from a the data folder passed in.

This is however, throwing the following exception:

~/code/tgalery/fastai/fastai/data/core.py in from_dblock(cls, dblock, source, path, bs, val_bs, shuffle, device, **kwargs)
    190     @classmethod
    191     def from_dblock(cls, dblock, source, path='.',  bs=64, val_bs=None, shuffle=True, device=None, **kwargs):
--> 192         return dblock.dataloaders(source, path=path, bs=bs, val_bs=val_bs, shuffle=shuffle, device=device, **kwargs)
    193 
    194     _docs=dict(__getitem__="Retrieve `DataLoader` at `i` (`0` is training, `1` is validation)",

~/code/tgalery/fastai/fastai/data/block.py in dataloaders(self, source, path, verbose, **kwargs)
    113         dsets = self.datasets(source, verbose=verbose)
    114         kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
--> 115         return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)
    116 
    117     _docs = dict(new="Create a new `DataBlock` with other `item_tfms` and `batch_tfms`",

~/code/tgalery/fastai/fastai/data/core.py in dataloaders(self, bs, shuffle_train, shuffle, val_shuffle, n, path, dl_type, dl_kwargs, device, drop_last, val_bs, **kwargs)
    229         val_kwargs={k[4:]:v for k,v in kwargs.items() if k.startswith('val_')}
    230         def_kwargs = {'bs':bs,'shuffle':shuffle,'drop_last':drop_last,'n':n,'device':device}
--> 231         dl = dl_type(self.subset(0), **merge(kwargs,def_kwargs, dl_kwargs[0]))
    232         def_kwargs = {'bs':bs if val_bs is None else val_bs,'shuffle':val_shuffle,'n':None,'drop_last':False}
    233         dls = [dl] + [dl.new(self.subset(i), **merge(kwargs,def_kwargs,val_kwargs,dl_kwargs[i]))

~/code/tgalery/fastai/fastai/text/data.py in __init__(self, dataset, sort_func, res, **kwargs)
    187         self.sort_func = _default_sort if sort_func is None else sort_func
    188         if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset)
--> 189         self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res
    190         if len(self.res) > 0: self.idx_max = np.argmax(self.res)
    191 

~/code/tgalery/fastai/fastai/text/data.py in <listcomp>(.0)
    187         self.sort_func = _default_sort if sort_func is None else sort_func
    188         if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset)
--> 189         self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res
    190         if len(self.res) > 0: self.idx_max = np.argmax(self.res)
    191 

~/code/tgalery/fastai/fastai/text/data.py in _default_sort(x)
    178 
    179 # Cell
--> 180 def _default_sort(x): return len(x[0])
    181 
    182 @delegates(TfmdDL)

TypeError: object of type 'PosixPath' has no len()

Any pointers on what I might be doing wrong ?

ohmeow commented 3 years ago

It looks like you're mixing some of the fastai bits used to work with an LSTM in here. This line:

dls = TextDataLoaders.from_dblock(dblock, path, path=path,
seq_len=config["max_seq_len"])

should probably be changed to something like this:


dls = dblock.dataloaders(<your data>)

Give it a go and lmk.

On Thu, Jul 15, 2021 at 9:02 AM Thiago Galery @.***> wrote:

Hi @ohmeow https://github.com/ohmeow , this is probably a basic question, but I'm having some issues to do causal language modelling from a set of wikitext-100 style files. My data folder is split as follows:

data/

  • train/
    • file1.txt
    • file2.txt
    • file3.txt
    • .....
  • valid/
    • file1.txt
    • file2.txt
    • file3.txt
    • .....

So far I have been following your gpt2 tutorial for LM:

pretrained_model_name = "pierreguillou/gpt2-small-portuguese" hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=AutoModelForCausalLM) if (hf_tokenizer.pad_token is None): hf_tokenizer.pad_token = '[PAD]' before_batch_tfm = HF_LMBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=CausalLMStrategy) blocks = [HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_CausalLMInput), noop] path = config["data_path"] get_items = partial(get_text_files, folders=[train, valid]) dblock = DataBlock(blocks=blocks, get_items=get_items, get_y=None, splitter=config["splitter"]) dls = TextDataLoaders.from_dblock(dblock, path, path=path, seq_len=config["max_seq_len"])

Config is a dict containing various params and a splitter class that's able to get the validation / train sets from a the data folder passed in.

This is however, throwing the following exception:

~/code/tgalery/fastai/fastai/data/core.py in from_dblock(cls, dblock, source, path, bs, val_bs, shuffle, device, kwargs) 190 @classmethod 191 def from_dblock(cls, dblock, source, path='.', bs=64, val_bs=None, shuffle=True, device=None, kwargs): --> 192 return dblock.dataloaders(source, path=path, bs=bs, val_bs=val_bs, shuffle=shuffle, device=device, **kwargs) 193 194 _docs=dict(getitem="Retrieve DataLoader at i (0 is training, 1 is validation)",

~/code/tgalery/fastai/fastai/data/block.py in dataloaders(self, source, path, verbose, kwargs) 113 dsets = self.datasets(source, verbose=verbose) 114 kwargs = {self.dls_kwargs, kwargs, 'verbose': verbose} --> 115 return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, kwargs) 116 117 _docs = dict(new="Create a new DataBlock with other item_tfms and batch_tfms",

~/code/tgalery/fastai/fastai/data/core.py in dataloaders(self, bs, shuffle_train, shuffle, val_shuffle, n, path, dl_type, dl_kwargs, device, drop_last, val_bs, kwargs) 229 valkwargs={k[4:]:v for k,v in kwargs.items() if k.startswith('val')} 230 def_kwargs = {'bs':bs,'shuffle':shuffle,'drop_last':drop_last,'n':n,'device':device} --> 231 dl = dl_type(self.subset(0), merge(kwargs,def_kwargs, dl_kwargs[0])) 232 def_kwargs = {'bs':bs if val_bs is None else val_bs,'shuffle':val_shuffle,'n':None,'drop_last':False} 233 dls = [dl] + [dl.new(self.subset(i), **merge(kwargs,def_kwargs,val_kwargs,dl_kwargs[i]))

~/code/tgalery/fastai/fastai/text/data.py in init(self, dataset, sort_func, res, **kwargs) 187 self.sort_func = _default_sort if sort_func is None else sort_func 188 if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset) --> 189 self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res 190 if len(self.res) > 0: self.idx_max = np.argmax(self.res) 191

~/code/tgalery/fastai/fastai/text/data.py in (.0) 187 self.sort_func = _default_sort if sort_func is None else sort_func 188 if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset) --> 189 self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res 190 if len(self.res) > 0: self.idx_max = np.argmax(self.res) 191

~/code/tgalery/fastai/fastai/text/data.py in _default_sort(x) 178 179 # Cell --> 180 def _default_sort(x): return len(x[0]) 181 182 @delegates(TfmdDL)

TypeError: object of type 'PosixPath' has no len()

Any pointers on what I might be doing wrong ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohmeow/blurr/issues/47, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADNMH25DKYWDKCIYJU3WTTX4BCXANCNFSM5AN3HRJA .

tgalery commented 3 years ago

Yeah I did your suggestion, and resulted in the same error essentially, I did some extra digging and found out that there's a dl_type kwarg that is passed to the data block api by the TextBlock api (which in turn is called by TextDataLoaders object) if you don't pass anything dl_type defaults to SortedDL , which could be responsible for the error above. So then, I changed a few things, the rewritten function is this:

def text_loader_from_blocks(blocks, config, train="train", valid="valid"):
    path = config["data_path"]
    get_items = partial(get_text_files, folders=[train, valid])
    dblock = DataBlock(blocks=blocks, get_items=get_items, splitter=config["splitter"], dl_type=LMDataLoader)
    return dblock.dataloaders(path, seq_len=config["max_seq_len"], verbose=True)

Now the verbose output looks like this:

Collecting items from /media/HD/data/pt_wiki/wiki/pt-2
Found 786 items
2 datasets of sizes 730,56
Setting up Pipeline: 
Setting up Pipeline:

So it looks like it's able to do the split between training and validation and identifying the right number of files, but the the fact the the Pipelines are empty is a bit funny. Usually they would like this (in the LSTM case):

Setting up Pipeline: Tokenizer -> Numericalize
Setting up after_item: Pipeline: ToTensor
Setting up before_batch: Pipeline: 
Setting up after_batch: Pipeline:

In any event, I get the following stack trace:

TypeError                                 Traceback (most recent call last)
<ipython-input-9-72d381481c19> in <module>
----> 1 lm_loader = text_loader_from_blocks(blocks, SAMPLE_CFG)

~/train.py in text_loader_from_blocks(blocks, config, train, valid)
     78     get_items = partial(get_text_files, folders=[train, valid])
     79     dblock = DataBlock(blocks=blocks, get_items=get_items, splitter=config["splitter"], dl_type=LMDataLoader)
---> 80     return dblock.dataloaders(path, seq_len=config["max_seq_len"], verbose=True)
     81 
     82 

~/code/tgalery/fastai/fastai/data/block.py in dataloaders(self, source, path, verbose, **kwargs)
    113         dsets = self.datasets(source, verbose=verbose)
    114         kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
--> 115         return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)
    116 
    117     _docs = dict(new="Create a new `DataBlock` with other `item_tfms` and `batch_tfms`",

~/code/tgalery/fastai/fastai/data/core.py in dataloaders(self, bs, shuffle_train, shuffle, val_shuffle, n, path, dl_type, dl_kwargs, device, drop_last, val_bs, **kwargs)
    229         val_kwargs={k[4:]:v for k,v in kwargs.items() if k.startswith('val_')}
    230         def_kwargs = {'bs':bs,'shuffle':shuffle,'drop_last':drop_last,'n':n,'device':device}
--> 231         dl = dl_type(self.subset(0), **merge(kwargs,def_kwargs, dl_kwargs[0]))
    232         def_kwargs = {'bs':bs if val_bs is None else val_bs,'shuffle':val_shuffle,'n':None,'drop_last':False}
    233         dls = [dl] + [dl.new(self.subset(i), **merge(kwargs,def_kwargs,val_kwargs,dl_kwargs[i]))

~/code/tgalery/fastai/fastai/text/data.py in __init__(self, dataset, lens, cache, bs, seq_len, num_workers, **kwargs)
     75         self.seq_len = seq_len
     76         if lens is None: lens = _get_lengths(dataset)
---> 77         if lens is None: lens = [len(o) for o in self.items]
     78         self.lens = ReindexCollection(lens, idxs=self.items.idxs)
     79         # The "-1" is to allow for final label, we throw away the end that's less than bs

~/code/tgalery/fastai/fastai/text/data.py in <listcomp>(.0)
     75         self.seq_len = seq_len
     76         if lens is None: lens = _get_lengths(dataset)
---> 77         if lens is None: lens = [len(o) for o in self.items]
     78         self.lens = ReindexCollection(lens, idxs=self.items.idxs)
     79         # The "-1" is to allow for final label, we throw away the end that's less than bs

TypeError: object of type 'PosixPath' has no len()

The funny thing, is that _get_lens(dataset) above in line 76 should have worked, as it is defined like this:

def _get_lengths(ds):
    tok = _get_tokenizer(ds)
    if tok is None: return
    return tok.get_lengths(ds.items)

so in a way, it seems that the tokenizer passed in the blocks are not being set properly (which we can see from the pipeline statements above). Any pointers ?

ohmeow commented 2 years ago

If you want to post a gist I can get at I can see if I can help you get things working.

-wg

On Mon, Jul 19, 2021 at 7:26 AM Thiago Galery @.***> wrote:

Yeah I did your suggestion, and resulted in the same error essentially, I did some extra digging and found out that there's a dl_type kwarg that is passed to the data block api by the TextBlock api (which in turn is called by TextDataLoaders object) if you don't pass anything dl_type defaults to SortedDL , which could be responsible for the error above. So then, I changed a few things, the rewritten function is this:

def text_loader_from_blocks(blocks, config, train="train", valid="valid"): path = config["data_path"] get_items = partial(get_text_files, folders=[train, valid]) dblock = DataBlock(blocks=blocks, get_items=get_items, splitter=config["splitter"], dl_type=LMDataLoader) return dblock.dataloaders(path, seq_len=config["max_seq_len"], verbose=True)

Now the verbose output looks like this:

Collecting items from /media/HD/data/pt_wiki/wiki/pt-2 Found 786 items 2 datasets of sizes 730,56 Setting up Pipeline: Setting up Pipeline:

So it looks like it's able to do the split between training and validation and identifying the right number of files, but the the fact the the Pipelines are empty is a bit funny. Usually they would like this (in the LSTM case):

Setting up Pipeline: Tokenizer -> Numericalize Setting up after_item: Pipeline: ToTensor Setting up before_batch: Pipeline: Setting up after_batch: Pipeline:

In any event, I get the following stack trace:

TypeError Traceback (most recent call last)

in ----> 1 lm_loader = text_loader_from_blocks(blocks, SAMPLE_CFG) ~/train.py in text_loader_from_blocks(blocks, config, train, valid) 78 get_items = partial(get_text_files, folders=[train, valid]) 79 dblock = DataBlock(blocks=blocks, get_items=get_items, splitter=config["splitter"], dl_type=LMDataLoader) ---> 80 return dblock.dataloaders(path, seq_len=config["max_seq_len"], verbose=True) 81 82 ~/code/tgalery/fastai/fastai/data/block.py in dataloaders(self, source, path, verbose, **kwargs) 113 dsets = self.datasets(source, verbose=verbose) 114 kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose} --> 115 return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs) 116 117 _docs = dict(new="Create a new `DataBlock` with other `item_tfms` and `batch_tfms`", ~/code/tgalery/fastai/fastai/data/core.py in dataloaders(self, bs, shuffle_train, shuffle, val_shuffle, n, path, dl_type, dl_kwargs, device, drop_last, val_bs, **kwargs) 229 val_kwargs={k[4:]:v for k,v in kwargs.items() if k.startswith('val_')} 230 def_kwargs = {'bs':bs,'shuffle':shuffle,'drop_last':drop_last,'n':n,'device':device} --> 231 dl = dl_type(self.subset(0), **merge(kwargs,def_kwargs, dl_kwargs[0])) 232 def_kwargs = {'bs':bs if val_bs is None else val_bs,'shuffle':val_shuffle,'n':None,'drop_last':False} 233 dls = [dl] + [dl.new(self.subset(i), **merge(kwargs,def_kwargs,val_kwargs,dl_kwargs[i])) ~/code/tgalery/fastai/fastai/text/data.py in __init__(self, dataset, lens, cache, bs, seq_len, num_workers, **kwargs) 75 self.seq_len = seq_len 76 if lens is None: lens = _get_lengths(dataset) ---> 77 if lens is None: lens = [len(o) for o in self.items] 78 self.lens = ReindexCollection(lens, idxs=self.items.idxs) 79 # The "-1" is to allow for final label, we throw away the end that's less than bs ~/code/tgalery/fastai/fastai/text/data.py in (.0) 75 self.seq_len = seq_len 76 if lens is None: lens = _get_lengths(dataset) ---> 77 if lens is None: lens = [len(o) for o in self.items] 78 self.lens = ReindexCollection(lens, idxs=self.items.idxs) 79 # The "-1" is to allow for final label, we throw away the end that's less than bs TypeError: object of type 'PosixPath' has no len() The funny thing, is that _get_lens(dataset) above in line 76 should have worked, as it is defined like this: def _get_lengths(ds): tok = _get_tokenizer(ds) if tok is None: return return tok.get_lengths(ds.items) so in a way, it seems that the tokenizer passed in the blocks are not being set properly (which we can see from the pipeline statements above). Any pointers ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .
tgalery commented 2 years ago

Thanks for this, I've created this gist https://gist.github.com/tgalery/fa0de7b0c69ab48534b26a9151676fc1 and have uploaded a sample of texts extracted from the portuguese wiki here https://drive.google.com/file/d/1IsgRFoFL4VGmQnb-oaQrTRzysRDMUYPV/view?usp=sharing (there's a train / valid / test ) folders with some files.

tgalery commented 2 years ago

any updates ?

ohmeow commented 2 years ago

I haven't forgot about you ... just busy with work :)

Will keep you posted on this

On Mon, Aug 2, 2021 at 1:08 AM Thiago Galery @.***> wrote:

any updates ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohmeow/blurr/issues/47#issuecomment-890816006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADNMFXFWGKBX5WIW3SOKDT2ZHAJANCNFSM5AN3HRJA .

ohmeow commented 2 years ago

any updates ?

I have an example using your example dataset. Do you mind if I post it as an example of how to do this in the Blurr docs? (just want to check that there is nothing private/sensitive with the data). I won't include the datasets but the show_batch and show_results will show some examples.

Lmk.

Thanks - wg

ohmeow commented 2 years ago

I just updated the gh issue. Got a working example ... was wondering if you're ok with me including it as an example in the blurr docs (of your dataset, it just shows a couple of examples.)

Lmk.

Thanks - wg

On Thu, Aug 12, 2021 at 10:37 AM Wayde Gilliam @.***> wrote:

I haven't forgot about you ... just busy with work :)

Will keep you posted on this

On Mon, Aug 2, 2021 at 1:08 AM Thiago Galery @.***> wrote:

any updates ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohmeow/blurr/issues/47#issuecomment-890816006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADNMFXFWGKBX5WIW3SOKDT2ZHAJANCNFSM5AN3HRJA .

tgalery commented 2 years ago

Yeah, I'm super ok with it. Once you have the link let me know so I can try it myself. Sorry for the delay, I was moving countries so I'm a bit thin on time.

ohmeow commented 2 years ago

Cool thanks ...

https://ohmeow.github.io/blurr/examples-causal-lm-gpt2/

On Tue, Aug 31, 2021 at 5:41 AM Thiago Galery @.***> wrote:

Yeah, I'm super ok with it. Once you have the link let me know so I can try it myself. Sorry for the delay, I was moving countries so I'm a bit thin on time.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohmeow/blurr/issues/47#issuecomment-909200237, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADNMDEW5TMUD256I766QDT7TEXXANCNFSM5AN3HRJA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.