ohmeow / blurr

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.
Apache License 2.0
289 stars 34 forks source link

Causal Language Modelling from files #47

Closed tgalery closed 2 years ago

tgalery commented 3 years ago

Hi @ohmeow , this is probably a basic question, but I'm having some issues to do causal language modelling from a set of wikitext-100 style files. My data folder is split as follows:

       - train/
           - file1.txt
           - file2.txt
           - file3.txt
           - ..... 
       - valid/
           - file1.txt
           - file2.txt
           - file3.txt
           - .....

So far I have been following your gpt2 tutorial for LM:

pretrained_model_name = "gpt2"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=AutoModelForCausalLM)
if (hf_tokenizer.pad_token is None): hf_tokenizer.pad_token = '[PAD]'
before_batch_tfm = HF_LMBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=CausalLMStrategy)
blocks = [HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_CausalLMInput), noop]
path = config["data_path"]
get_items = partial(get_text_files, folders=[train, valid])
dblock = DataBlock(blocks=blocks, get_items=get_items, get_y=None, splitter=config["splitter"])
dls = TextDataLoaders.from_dblock(dblock, path, path=path, seq_len=config["max_seq_len"])

Config is a dict containing various params and a splitter class that's able to get the validation / train sets from a the data folder passed in.

This is however, throwing the following exception:

~/code/tgalery/fastai/fastai/data/core.py in from_dblock(cls, dblock, source, path, bs, val_bs, shuffle, device, **kwargs)
    190     @classmethod
    191     def from_dblock(cls, dblock, source, path='.',  bs=64, val_bs=None, shuffle=True, device=None, **kwargs):
--> 192         return dblock.dataloaders(source, path=path, bs=bs, val_bs=val_bs, shuffle=shuffle, device=device, **kwargs)
    194     _docs=dict(__getitem__="Retrieve `DataLoader` at `i` (`0` is training, `1` is validation)",

~/code/tgalery/fastai/fastai/data/block.py in dataloaders(self, source, path, verbose, **kwargs)
    113         dsets = self.datasets(source, verbose=verbose)
    114         kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
--> 115         return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)
    117     _docs = dict(new="Create a new `DataBlock` with other `item_tfms` and `batch_tfms`",

~/code/tgalery/fastai/fastai/data/core.py in dataloaders(self, bs, shuffle_train, shuffle, val_shuffle, n, path, dl_type, dl_kwargs, device, drop_last, val_bs, **kwargs)
    229         val_kwargs={k[4:]:v for k,v in kwargs.items() if k.startswith('val_')}
    230         def_kwargs = {'bs':bs,'shuffle':shuffle,'drop_last':drop_last,'n':n,'device':device}
--> 231         dl = dl_type(self.subset(0), **merge(kwargs,def_kwargs, dl_kwargs[0]))
    232         def_kwargs = {'bs':bs if val_bs is None else val_bs,'shuffle':val_shuffle,'n':None,'drop_last':False}
    233         dls = [dl] + [dl.new(self.subset(i), **merge(kwargs,def_kwargs,val_kwargs,dl_kwargs[i]))

~/code/tgalery/fastai/fastai/text/data.py in __init__(self, dataset, sort_func, res, **kwargs)
    187         self.sort_func = _default_sort if sort_func is None else sort_func
    188         if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset)
--> 189         self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res
    190         if len(self.res) > 0: self.idx_max = np.argmax(self.res)

~/code/tgalery/fastai/fastai/text/data.py in <listcomp>(.0)
    187         self.sort_func = _default_sort if sort_func is None else sort_func
    188         if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset)
--> 189         self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res
    190         if len(self.res) > 0: self.idx_max = np.argmax(self.res)

~/code/tgalery/fastai/fastai/text/data.py in _default_sort(x)
    179 # Cell
--> 180 def _default_sort(x): return len(x[0])
    182 @delegates(TfmdDL)

TypeError: object of type 'PosixPath' has no len()

Any pointers on what I might be doing wrong ?

ohmeow commented 3 years ago

It looks like you're mixing some of the fastai bits used to work with an LSTM in here. This line:

dls = TextDataLoaders.from_dblock(dblock, path, path=path,

should probably be changed to something like this:

dls = dblock.dataloaders(<your data>)

Give it a go and lmk.

On Thu, Jul 15, 2021 at 9:02 AM Thiago Galery @.***> wrote:

Hi @ohmeow https://github.com/ohmeow , this is probably a basic question, but I'm having some issues to do causal language modelling from a set of wikitext-100 style files. My data folder is split as follows:


  • train/
    • file1.txt
    • file2.txt
    • file3.txt
    • .....
  • valid/
    • file1.txt
    • file2.txt
    • file3.txt
    • .....

So far I have been following your gpt2 tutorial for LM:

pretrained_model_name = "pierreguillou/gpt2-small-portuguese" hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=AutoModelForCausalLM) if (hf_tokenizer.pad_token is None): hf_tokenizer.pad_token = '[PAD]' before_batch_tfm = HF_LMBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=CausalLMStrategy) blocks = [HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_CausalLMInput), noop] path = config["data_path"] get_items = partial(get_text_files, folders=[train, valid]) dblock = DataBlock(blocks=blocks, get_items=get_items, get_y=None, splitter=config["splitter"]) dls = TextDataLoaders.from_dblock(dblock, path, path=path, seq_len=config["max_seq_len"])

Config is a dict containing various params and a splitter class that's able to get the validation / train sets from a the data folder passed in.

This is however, throwing the following exception:

~/code/tgalery/fastai/fastai/data/core.py in from_dblock(cls, dblock, source, path, bs, val_bs, shuffle, device, kwargs) 190 @classmethod 191 def from_dblock(cls, dblock, source, path='.', bs=64, val_bs=None, shuffle=True, device=None, kwargs): --> 192 return dblock.dataloaders(source, path=path, bs=bs, val_bs=val_bs, shuffle=shuffle, device=device, **kwargs) 193 194 _docs=dict(getitem="Retrieve DataLoader at i (0 is training, 1 is validation)",

~/code/tgalery/fastai/fastai/data/block.py in dataloaders(self, source, path, verbose, kwargs) 113 dsets = self.datasets(source, verbose=verbose) 114 kwargs = {self.dls_kwargs, kwargs, 'verbose': verbose} --> 115 return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, kwargs) 116 117 _docs = dict(new="Create a new DataBlock with other item_tfms and batch_tfms",

~/code/tgalery/fastai/fastai/data/core.py in dataloaders(self, bs, shuffle_train, shuffle, val_shuffle, n, path, dl_type, dl_kwargs, device, drop_last, val_bs, kwargs) 229 valkwargs={k[4:]:v for k,v in kwargs.items() if k.startswith('val')} 230 def_kwargs = {'bs':bs,'shuffle':shuffle,'drop_last':drop_last,'n':n,'device':device} --> 231 dl = dl_type(self.subset(0), merge(kwargs,def_kwargs, dl_kwargs[0])) 232 def_kwargs = {'bs':bs if val_bs is None else val_bs,'shuffle':val_shuffle,'n':None,'drop_last':False} 233 dls = [dl] + [dl.new(self.subset(i), **merge(kwargs,def_kwargs,val_kwargs,dl_kwargs[i]))

~/code/tgalery/fastai/fastai/text/data.py in init(self, dataset, sort_func, res, **kwargs) 187 self.sort_func = _default_sort if sort_func is None else sort_func 188 if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset) --> 189 self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res 190 if len(self.res) > 0: self.idx_max = np.argmax(self.res) 191

~/code/tgalery/fastai/fastai/text/data.py in (.0) 187 self.sort_func = _default_sort if sort_func is None else sort_func 188 if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset) --> 189 self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res 190 if len(self.res) > 0: self.idx_max = np.argmax(self.res) 191

~/code/tgalery/fastai/fastai/text/data.py in _default_sort(x) 178 179 # Cell --> 180 def _default_sort(x): return len(x[0]) 181 182 @delegates(TfmdDL)

TypeError: object of type 'PosixPath' has no len()

Any pointers on what I might be doing wrong ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohmeow/blurr/issues/47, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADNMH25DKYWDKCIYJU3WTTX4BCXANCNFSM5AN3HRJA .

tgalery commented 3 years ago

Yeah I did your suggestion, and resulted in the same error essentially, I did some extra digging and found out that there's a dl_type kwarg that is passed to the data block api by the TextBlock api (which in turn is called by TextDataLoaders object) if you don't pass anything dl_type defaults to SortedDL , which could be responsible for the error above. So then, I changed a few things, the rewritten function is this:

def text_loader_from_blocks(blocks, config, train="train", valid="valid"):
    path = config["data_path"]
    get_items = partial(get_text_files, folders=[train, valid])
    dblock = DataBlock(blocks=blocks, get_items=get_items, splitter=config["splitter"], dl_type=LMDataLoader)
    return dblock.dataloaders(path, seq_len=config["max_seq_len"], verbose=True)

Now the verbose output looks like this:

Collecting items from /media/HD/data/pt_wiki/wiki/pt-2
Found 786 items
2 datasets of sizes 730,56
Setting up Pipeline: 
Setting up Pipeline:

So it looks like it's able to do the split between training and validation and identifying the right number of files, but the the fact the the Pipelines are empty is a bit funny. Usually they would like this (in the LSTM case):

Setting up Pipeline: Tokenizer -> Numericalize
Setting up after_item: Pipeline: ToTensor
Setting up before_batch: Pipeline: 
Setting up after_batch: Pipeline:

In any event, I get the following stack trace:

TypeError                                 Traceback (most recent call last)
<ipython-input-9-72d381481c19> in <module>
----> 1 lm_loader = text_loader_from_blocks(blocks, SAMPLE_CFG)

~/train.py in text_loader_from_blocks(blocks, config, train, valid)
     78     get_items = partial(get_text_files, folders=[train, valid])
     79     dblock = DataBlock(blocks=blocks, get_items=get_items, splitter=config["splitter"], dl_type=LMDataLoader)
---> 80     return dblock.dataloaders(path, seq_len=config["max_seq_len"], verbose=True)

~/code/tgalery/fastai/fastai/data/block.py in dataloaders(self, source, path, verbose, **kwargs)
    113         dsets = self.datasets(source, verbose=verbose)
    114         kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
--> 115         return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)
    117     _docs = dict(new="Create a new `DataBlock` with other `item_tfms` and `batch_tfms`",

~/code/tgalery/fastai/fastai/data/core.py in dataloaders(self, bs, shuffle_train, shuffle, val_shuffle, n, path, dl_type, dl_kwargs, device, drop_last, val_bs, **kwargs)
    229         val_kwargs={k[4:]:v for k,v in kwargs.items() if k.startswith('val_')}
    230         def_kwargs = {'bs':bs,'shuffle':shuffle,'drop_last':drop_last,'n':n,'device':device}
--> 231         dl = dl_type(self.subset(0), **merge(kwargs,def_kwargs, dl_kwargs[0]))
    232         def_kwargs = {'bs':bs if val_bs is None else val_bs,'shuffle':val_shuffle,'n':None,'drop_last':False}
    233         dls = [dl] + [dl.new(self.subset(i), **merge(kwargs,def_kwargs,val_kwargs,dl_kwargs[i]))

~/code/tgalery/fastai/fastai/text/data.py in __init__(self, dataset, lens, cache, bs, seq_len, num_workers, **kwargs)
     75         self.seq_len = seq_len
     76         if lens is None: lens = _get_lengths(dataset)
---> 77         if lens is None: lens = [len(o) for o in self.items]
     78         self.lens = ReindexCollection(lens, idxs=self.items.idxs)
     79         # The "-1" is to allow for final label, we throw away the end that's less than bs

~/code/tgalery/fastai/fastai/text/data.py in <listcomp>(.0)
     75         self.seq_len = seq_len
     76         if lens is None: lens = _get_lengths(dataset)
---> 77         if lens is None: lens = [len(o) for o in self.items]
     78         self.lens = ReindexCollection(lens, idxs=self.items.idxs)
     79         # The "-1" is to allow for final label, we throw away the end that's less than bs

TypeError: object of type 'PosixPath' has no len()

The funny thing, is that _get_lens(dataset) above in line 76 should have worked, as it is defined like this:

def _get_lengths(ds):
    tok = _get_tokenizer(ds)
    if tok is None: return
    return tok.get_lengths(ds.items)

so in a way, it seems that the tokenizer passed in the blocks are not being set properly (which we can see from the pipeline statements above). Any pointers ?

ohmeow commented 2 years ago

If you want to post a gist I can get at I can see if I can help you get things working.


On Mon, Jul 19, 2021 at 7:26 AM Thiago Galery @.***> wrote:

Yeah I did your suggestion, and resulted in the same error essentially, I did some extra digging and found out that there's a dl_type kwarg that is passed to the data block api by the TextBlock api (which in turn is called by TextDataLoaders object) if you don't pass anything dl_type defaults to SortedDL , which could be responsible for the error above. So then, I changed a few things, the rewritten function is this:

def text_loader_from_blocks(blocks, config, train="train", valid="valid"): path = config["data_path"] get_items = partial(get_text_files, folders=[train, valid]) dblock = DataBlock(blocks=blocks, get_items=get_items, splitter=config["splitter"], dl_type=LMDataLoader) return dblock.dataloaders(path, seq_len=config["max_seq_len"], verbose=True)

Now the verbose output looks like this:

Collecting items from /media/HD/data/pt_wiki/wiki/pt-2 Found 786 items 2 datasets of sizes 730,56 Setting up Pipeline: Setting up Pipeline:

So it looks like it's able to do the split between training and validation and identifying the right number of files, but the the fact the the Pipelines are empty is a bit funny. Usually they would like this (in the LSTM case):

Setting up Pipeline: Tokenizer -> Numericalize Setting up after_item: Pipeline: ToTensor Setting up before_batch: Pipeline: Setting up after_batch: Pipeline:

In any event, I get the following stack trace:

TypeError Traceback (most recent call last)

in ----> 1 lm_loader = text_loader_from_blocks(blocks, SAMPLE_CFG) ~/train.py in text_loader_from_blocks(blocks, config, train, valid) 78 get_items = partial(get_text_files, folders=[train, valid]) 79 dblock = DataBlock(blocks=blocks, get_items=get_items, splitter=config["splitter"], dl_type=LMDataLoader) ---> 80 return dblock.dataloaders(path, seq_len=config["max_seq_len"], verbose=True) 81 82 ~/code/tgalery/fastai/fastai/data/block.py in dataloaders(self, source, path, verbose, **kwargs) 113 dsets = self.datasets(source, verbose=verbose) 114 kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose} --> 115 return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs) 116 117 _docs = dict(new="Create a new `DataBlock` with other `item_tfms` and `batch_tfms`", ~/code/tgalery/fastai/fastai/data/core.py in dataloaders(self, bs, shuffle_train, shuffle, val_shuffle, n, path, dl_type, dl_kwargs, device, drop_last, val_bs, **kwargs) 229 val_kwargs={k[4:]:v for k,v in kwargs.items() if k.startswith('val_')} 230 def_kwargs = {'bs':bs,'shuffle':shuffle,'drop_last':drop_last,'n':n,'device':device} --> 231 dl = dl_type(self.subset(0), **merge(kwargs,def_kwargs, dl_kwargs[0])) 232 def_kwargs = {'bs':bs if val_bs is None else val_bs,'shuffle':val_shuffle,'n':None,'drop_last':False} 233 dls = [dl] + [dl.new(self.subset(i), **merge(kwargs,def_kwargs,val_kwargs,dl_kwargs[i])) ~/code/tgalery/fastai/fastai/text/data.py in __init__(self, dataset, lens, cache, bs, seq_len, num_workers, **kwargs) 75 self.seq_len = seq_len 76 if lens is None: lens = _get_lengths(dataset) ---> 77 if lens is None: lens = [len(o) for o in self.items] 78 self.lens = ReindexCollection(lens, idxs=self.items.idxs) 79 # The "-1" is to allow for final label, we throw away the end that's less than bs ~/code/tgalery/fastai/fastai/text/data.py in (.0) 75 self.seq_len = seq_len 76 if lens is None: lens = _get_lengths(dataset) ---> 77 if lens is None: lens = [len(o) for o in self.items] 78 self.lens = ReindexCollection(lens, idxs=self.items.idxs) 79 # The "-1" is to allow for final label, we throw away the end that's less than bs TypeError: object of type 'PosixPath' has no len() The funny thing, is that _get_lens(dataset) above in line 76 should have worked, as it is defined like this: def _get_lengths(ds): tok = _get_tokenizer(ds) if tok is None: return return tok.get_lengths(ds.items) so in a way, it seems that the tokenizer passed in the blocks are not being set properly (which we can see from the pipeline statements above). Any pointers ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or unsubscribe .
tgalery commented 2 years ago

Thanks for this, I've created this gist https://gist.github.com/tgalery/fa0de7b0c69ab48534b26a9151676fc1 and have uploaded a sample of texts extracted from the portuguese wiki here https://drive.google.com/file/d/1IsgRFoFL4VGmQnb-oaQrTRzysRDMUYPV/view?usp=sharing (there's a train / valid / test ) folders with some files.

tgalery commented 2 years ago

any updates ?

ohmeow commented 2 years ago

I haven't forgot about you ... just busy with work :)

Will keep you posted on this

On Mon, Aug 2, 2021 at 1:08 AM Thiago Galery @.***> wrote:

any updates ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohmeow/blurr/issues/47#issuecomment-890816006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADNMFXFWGKBX5WIW3SOKDT2ZHAJANCNFSM5AN3HRJA .

ohmeow commented 2 years ago

any updates ?

I have an example using your example dataset. Do you mind if I post it as an example of how to do this in the Blurr docs? (just want to check that there is nothing private/sensitive with the data). I won't include the datasets but the show_batch and show_results will show some examples.


Thanks - wg

ohmeow commented 2 years ago

I just updated the gh issue. Got a working example ... was wondering if you're ok with me including it as an example in the blurr docs (of your dataset, it just shows a couple of examples.)


Thanks - wg

On Thu, Aug 12, 2021 at 10:37 AM Wayde Gilliam @.***> wrote:

I haven't forgot about you ... just busy with work :)

Will keep you posted on this

On Mon, Aug 2, 2021 at 1:08 AM Thiago Galery @.***> wrote:

any updates ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohmeow/blurr/issues/47#issuecomment-890816006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADNMFXFWGKBX5WIW3SOKDT2ZHAJANCNFSM5AN3HRJA .

tgalery commented 2 years ago

Yeah, I'm super ok with it. Once you have the link let me know so I can try it myself. Sorry for the delay, I was moving countries so I'm a bit thin on time.

ohmeow commented 2 years ago

Cool thanks ...


On Tue, Aug 31, 2021 at 5:41 AM Thiago Galery @.***> wrote:

Yeah, I'm super ok with it. Once you have the link let me know so I can try it myself. Sorry for the delay, I was moving countries so I'm a bit thin on time.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ohmeow/blurr/issues/47#issuecomment-909200237, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADNMDEW5TMUD256I766QDT7TEXXANCNFSM5AN3HRJA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.