herobd / dessurt

Official implementation for Dessurt
MIT License
56 stars 8 forks source link

trainer.py file #4

Closed ArsalanAli915 closed 1 year ago

ArsalanAli915 commented 1 year ago

/content/dessurt/train.py in main(rank, config, resume, world_size) 136 print("Begin training") 137 #warnings.filterwarnings("error") --> 138 trainer.train() 139 140

/content/dessurt/base/base_trainer.py in train(self) 352 353 if result is None: --> 354 result = self._train_iteration(self.iteration) 355 #if self.retry_count>1: 356 # print('Failed all {} times!'.format(self.retry_count))

/content/dessurt/trainer/qa_trainer.py in _train_iteration(self, iteration) 130 batch_idx = (iteration-1) % len(self.data_loader) 131 try: --> 132 thisInstance = self.data_loader_iter.next() 133 except StopIteration: 134 self.data_loader_iter = iter(self.data_loader)

AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute 'next'

I am facing above error when I am fine tuneing on dessurt model.

herobd commented 1 year ago

Hmm, This seems very odd as an object returned by iter() should have next(). Maybe try running it without any dataloader multiprocessing to see if that changes things. In the config set data_loader->num_workers=0

ArsalanAli915 commented 1 year ago

I will try this and let me see if this works. Thanks for helping.

On Thu, Jan 5, 2023, 10:07 PM Brian Davis @.***> wrote:

Hmm, This seems very odd as an object returned by iter() should have next(). Maybe try running it without any dataloader multiprocessing to see if that changes things. In the config set data_loader->num_workers=0

— Reply to this email directly, view it on GitHub https://github.com/herobd/dessurt/issues/4#issuecomment-1372488627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYSV6SKXCMDXADDO63JLOZLWQ35UDANCNFSM6AAAAAATRZXJOE . You are receiving this because you authored the thread.Message ID: @.***>

ArsalanAli915 commented 1 year ago

I have tried by keep no of workers to 0. Then it gives error as. SingleprocessDataloader object has no attribute next. Do it necessary to fine tune on GPUs? Maybe some upgradation is done. Please check.

On Thu, Jan 5, 2023, 10:32 PM I191885 Arsalan @.***> wrote:

I will try this and let me see if this works. Thanks for helping.

On Thu, Jan 5, 2023, 10:07 PM Brian Davis @.***> wrote:

Hmm, This seems very odd as an object returned by iter() should have next(). Maybe try running it without any dataloader multiprocessing to see if that changes things. In the config set data_loader->num_workers=0

— Reply to this email directly, view it on GitHub https://github.com/herobd/dessurt/issues/4#issuecomment-1372488627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYSV6SKXCMDXADDO63JLOZLWQ35UDANCNFSM6AAAAAATRZXJOE . You are receiving this because you authored the thread.Message ID: @.***>

herobd commented 1 year ago

What are you finetuning? Is it a custom data object? Could you share the config json?

ArsalanAli915 commented 1 year ago

config file

On Thu, Jan 5, 2023 at 11:01 PM Brian Davis @.***> wrote:

What are you finetuning? Is it a custom data object? Could you share the config json?

— Reply to this email directly, view it on GitHub https://github.com/herobd/dessurt/issues/4#issuecomment-1372549974, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYSV6SPMYRMBLKVVZTRH3ATWQ4D6LANCNFSM6AAAAAATRZXJOE . You are receiving this because you authored the thread.Message ID: @.***>

ArsalanAli915 commented 1 year ago

I have found on stack overflow.

dataiter = iter(dataloader) data = dataiter.next()

You need to use the following instead and it works perfectly:

dataiter = iter(dataloader) data = next(dataiter)

On Thu, Jan 5, 2023 at 11:07 PM I191885 Arsalan @.***> wrote:

config file

On Thu, Jan 5, 2023 at 11:01 PM Brian Davis @.***> wrote:

What are you finetuning? Is it a custom data object? Could you share the config json?

— Reply to this email directly, view it on GitHub https://github.com/herobd/dessurt/issues/4#issuecomment-1372549974, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYSV6SPMYRMBLKVVZTRH3ATWQ4D6LANCNFSM6AAAAAATRZXJOE . You are receiving this because you authored the thread.Message ID: @.***>

herobd commented 1 year ago

Huh, that must be a pytorch update. I'm glad you found the solution

ArsalanAli915 commented 1 year ago

I have prepared my own dataset but I am using MyDataset class to fine tune it. as it is mentioned on example_data.

On Thu, Jan 5, 2023 at 11:09 PM I191885 Arsalan @.***> wrote:

I have found on stack overflow.

dataiter = iter(dataloader) data = dataiter.next()

You need to use the following instead and it works perfectly:

dataiter = iter(dataloader) data = next(dataiter)

On Thu, Jan 5, 2023 at 11:07 PM I191885 Arsalan @.***> wrote:

config file

On Thu, Jan 5, 2023 at 11:01 PM Brian Davis @.***> wrote:

What are you finetuning? Is it a custom data object? Could you share the config json?

— Reply to this email directly, view it on GitHub https://github.com/herobd/dessurt/issues/4#issuecomment-1372549974, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYSV6SPMYRMBLKVVZTRH3ATWQ4D6LANCNFSM6AAAAAATRZXJOE . You are receiving this because you authored the thread.Message ID: @.***>

ArsalanAli915 commented 1 year ago

yeah. please let me know as you update. so that I can move forward with training. I would like to ask some questions. do this model work best on handwritten medical prescription? I have read the paper where it mentioned that result are not staisfactory. I am doing research. so please let me know so that I can move move ahead. thanks.

On Thu, Jan 5, 2023 at 11:12 PM Brian Davis @.***> wrote:

Huh, that must be a pytorch update. I'm glad you found the solution

— Reply to this email directly, view it on GitHub https://github.com/herobd/dessurt/issues/4#issuecomment-1372562881, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYSV6SKYPZ5KV2IPIWEVO7LWQ4FHFANCNFSM6AAAAAATRZXJOE . You are receiving this because you authored the thread.Message ID: @.***>

herobd commented 1 year ago

I've fixed the .next() issue.

You'd really have to test it on your data to know. It was primarily trained with IAM handwriting data, which is pretty clean, so it may struggle if the handwriting is cramped. But if you have enough data it may learn the new handwriting style.

ArsalanAli915 commented 1 year ago

I would try and let you know. On Fri, Jan 6, 2023, 1:36 AM Brian Davis @.***> wrote:

I've fixed the .next() issue.

You'd really have to test it on your data to know. It was primarily trained with IAM handwriting data, which is pretty clean, so it may struggle if the handwriting is cramped. But if you have enough data it may learn the new handwriting style.

— Reply to this email directly, view it on GitHub https://github.com/herobd/dessurt/issues/4#issuecomment-1372715660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYSV6SLCTTHRLRGE2SN3UATWQ4WFZANCNFSM6AAAAAATRZXJOE . You are receiving this because you authored the thread.Message ID: @.***>

ArsalanAli915 commented 1 year ago

Thanks for the update. Now it works fine. I have found now problem when evaluating the model. I have attached the image containing the screenshot of the error. Please help me with this.

On Fri, Jan 6, 2023 at 8:18 AM I191885 Arsalan @.***> wrote:

I would try and let you know. On Fri, Jan 6, 2023, 1:36 AM Brian Davis @.***> wrote:

I've fixed the .next() issue.

You'd really have to test it on your data to know. It was primarily trained with IAM handwriting data, which is pretty clean, so it may struggle if the handwriting is cramped. But if you have enough data it may learn the new handwriting style.

— Reply to this email directly, view it on GitHub https://github.com/herobd/dessurt/issues/4#issuecomment-1372715660, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYSV6SLCTTHRLRGE2SN3UATWQ4WFZANCNFSM6AAAAAATRZXJOE . You are receiving this because you authored the thread.Message ID: @.***>

herobd commented 1 year ago

I don't see a screenship

ArsalanAli915 commented 1 year ago

This is the error which I have found when evaluating the weight that I loaded iteration 0 unspecified dataset: MultipleDataset getting data ready

ModuleNotFoundError Traceback (most recent call last) in 1 from qa_eval import main as evaluate 2 #Normally you'd use python qa_eval.py -c saved/dessurt_mnist/model_best.pth -g 0 ----> 3 evaluate(checkpoint,None, 0 if gpu else None, run=True)

4 frames /content/dessurt/qa_eval.py in main(resume, data_set_name, gpu, config, addToConfig, test, verbose, run, smaller_set, eval_full, ner_do_before) 417 418 print('getting data ready') --> 419 data_loader, valid_data_loader = getDataLoader(data_config,'train' if not test else 'test') 420 421 if test:

/content/dessurt/data_loader/data_loaders.py in getDataLoader(config, split, rank, world_size) 34 35 if data_set_name=='MultipleDataset': ---> 36 from data_sets import multiple_dataset 37 config['data_loader']['super_computer']=config['super_computer'] 38 config['validation']['super_computer']=config['super_computer']

/content/dessurt/data_sets/multiple_dataset.py in 14 15 from .qa import collate ---> 16 from .synth_form_dataset import SynthFormDataset 17 from .synth_para_qa import SynthParaQA 18 from .funsd_qa import FUNSDQA

/content/dessurt/data_sets/synth_form_dataset.py in 9 10 from .form_qa import FormQA, collate, Entity, Line, Table ---> 11 from .gen_daemon import GenDaemon 12 13

/content/dessurt/data_sets/gen_daemon.py in 2 from utils.util import ensure_dir 3 import threading ----> 4 from synthetic_text_gen import SyntheticWord 5 from data_sets import getWikiArticle, getWikiDataset 6 import os, random, re, time

ModuleNotFoundError: No module named 'synthetic_text_gen'


NOTE: If your import is failing due to a missing package, you can manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the "Open Examples" button below.

On Tue, Jan 10, 2023 at 1:12 AM Brian Davis @.***> wrote:

I don't see a screenship

— Reply to this email directly, view it on GitHub https://github.com/herobd/dessurt/issues/4#issuecomment-1376249412, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYSV6SNRFL2LNYR6V3QHZQLWRRWKRANCNFSM6AAAAAATRZXJOE . You are receiving this because you authored the thread.Message ID: @.***>

herobd commented 1 year ago

The import error is fixed by installing https://github.com/herobd/synthetic_text_gen However, you shouldn't be evaluating with MultipleDataset (unless you're doing something unusual). It should be MyDataset, right?

ArsalanAli915 commented 1 year ago

Yes is correct. Thanks for commenting. I solved this problem though, another problem that I am facing I'd that I have 2334, 1646 image When I put this I found error. Whereas before that it's was working fine when I put 96,96 which is size you have mentioned in demo of Mnist Data set. Please do let me know how I can solve this. This is the error I received.

RuntimeError Traceback (most recent call last) in 7 8 #Normally you'd use python train.py -c configs/cf_dessurt_mnist.json -r saved/dessurt_mnist/checkpoint-latest.pth ----> 9 train(None,config,checkpoint)

3 frames /content/dessurt/train.py in main(rank, config, resume, world_size) 83 config['supercomputer'] = '{}{}'.format(config['name'],rank) 84 ---> 85 model = eval(config['arch'])(config['model']) 86 87 if config.get('PRINT_MODEL',False):

/content/dessurt/model/dessurt.py in init(self, config) 175 use_auto_here = use_auto[len(self.swin_layers)] 176 self.swin_layers.append( nn.ModuleList( [ --> 177 SwinTransformerBlock(dim=d_im, 178 input_resolution=cur_resolution, 179 num_heads=swin_nhead[level],

/content/dessurt/model/swin_transformer.py in init(self, dim, input_resolution, num_heads, window_size, shift_size, mlp_ratio, qkv_bias, qk_scale, drop, attn_drop, drop_path, act_layer, norm_layer, sees_docq) 257 cnt += 1 258 --> 259 mask_windows = window_partition(img_mask, self.window_size) # nW, window_size, window_size, 1 260 mask_windows = mask_windows.view(-1, self.window_size * self.window_size) 261 attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)

/content/dessurt/model/swin_transformer.py in window_partition(x, window_size) 42 """ 43 B, H, W, C = x.shape ---> 44 x = x.view(B, H // window_size, window_size, W // window_size, window_size, C) 45 windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C) 46 return windows

RuntimeError: shape '[1, 17, 12, 24, 12, 1]' is invalid for input of size 59450

On Wed, Jan 11, 2023, 4:29 AM Brian Davis @.***> wrote:

The import error is fixed by installing https://github.com/herobd/synthetic_text_gen However, you shouldn't be evaluating with MultipleDataset (unless you're doing something unusual). It should be MyDataset, right?

— Reply to this email directly, view it on GitHub https://github.com/herobd/dessurt/issues/4#issuecomment-1378030801, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYSV6SLUKAFNVXBF2FZ37K3WRXWFZANCNFSM6AAAAAATRZXJOE . You are receiving this because you authored the thread.Message ID: @.***>

herobd commented 1 year ago

Sorry, this note is buried in the README: Swin implementation requires (image size / 8)%window_size==0 (8 is from 4x downsample from CNN and 2x downsample from Swin downsample)

You should just need to adjust your image height and width so that both are multiples of 8 and divided by 8 they are multiples of your window_size (12?).

herobd commented 1 year ago

I've added that bit somewhere more obvious in the README so hopefully it doesn't trip up someone else.

ArsalanAli915 commented 1 year ago

Thanks, I'll take a look.

On Wed, Jan 11, 2023, 11:18 PM Brian Davis @.***> wrote:

I've added that bit somewhere more obvious in the README so hopefully it doesn't trip up someone else.

— Reply to this email directly, view it on GitHub https://github.com/herobd/dessurt/issues/4#issuecomment-1379300773, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYSV6SP6NUIETAJACH77CJLWR32O3ANCNFSM6AAAAAATRZXJOE . You are receiving this because you authored the thread.Message ID: @.***>

ArsalanAli915 commented 1 year ago

I have tried to adjust the size. now the size is [1640,2320] both length and hight are multiple of 8. Yet I found the same error.

herobd commented 1 year ago

What is the swin window size in the config?

ArsalanAli915 commented 1 year ago

i have resolve the size error. The following error is occurring as I have increase the input image size now. I have also tried to reduce the batch size to 1 but still I found this error on colab.


OutOfMemoryError Traceback (most recent call last) in 7 8 #Normally you'd use python train.py -c configs/cf_dessurt_mnist.json -r saved/dessurt_mnist/checkpoint-latest.pth ----> 9 train(None,config,checkpoint)

12 frames /content/dessurt/train.py in main(rank, config, resume, world_size) 136 print("Begin training") 137 #warnings.filterwarnings("error") --> 138 trainer.train() 139 140

/content/dessurt/base/base_trainer.py in train(self) 352 353 if result is None: --> 354 result = self._train_iteration(self.iteration) 355 #if self.retry_count>1: 356 # print('Failed all {} times!'.format(self.retry_count))

/content/dessurt/trainer/qa_trainer.py in _train_iteration(self, iteration) 153 losses, run_log, out = self.run(thisInstance) 154 else: --> 155 losses, run_log, out = self.run(thisInstance) 156 #t#self.opt_history['full run'].append(timeit.default_timer()-tic)#t# 157

/content/dessurt/trainer/qa_trainer.py in run(self, instance, get, forward_only, valid, run) 349 pred_a, target_a, string_a, pred_mask, pred_logits = self.model(image,questions,answers,get_logits=True) 350 else: --> 351 pred_a, target_a, string_a, pred_mask = self.model(image,questions,answers) 352 353

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, *kwargs) 1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1193 or _global_forward_hooks or _global_forward_pre_hooks): -> 1194 return forward_call(input, **kwargs) 1195 # Do not call functions when jit is used 1196 full_backward_hooks, non_full_backward_hooks = [], []

/content/dessurt/model/dessurt.py in forward(self, image, questions, answers, RUN, get_tokens, distill, get_logits) 419 420 #textual token update --> 421 qa_tokens = autoregr_layer( 422 qa_tokens, 423 torch.cat((proj_im_tokens,qa_tokens),dim=1),

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, *kwargs) 1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1193 or _global_forward_hooks or _global_forward_pre_hooks): -> 1194 return forward_call(input, **kwargs) 1195 # Do not call functions when jit is used 1196 full_backward_hooks, non_full_backward_hooks = [], []

/content/dessurt/model/dessurt.py in forward(self, query_tokens, all_tokens, full_mask, all_padding_mask) 770 """ 771 --> 772 response = self.self_attn(query_tokens, all_tokens, all_tokens, 773 mask=full_mask, 774 key_padding_mask=all_padding_mask,

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, *kwargs) 1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1193 or _global_forward_hooks or _global_forward_pre_hooks): -> 1194 return forward_call(input, **kwargs) 1195 # Do not call functions when jit is used 1196 full_backward_hooks, non_full_backward_hooks = [], []

/content/dessurt/model/attention.py in forward(self, query, key, value, mask, key_padding_mask) 120 # 1) Do all the linear projections in batch from d_model => h x d_k 121 query, key, value = \ --> 122 [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2) 123 for l, x in zip(self.linears, (query, key, value))] 124

/content/dessurt/model/attention.py in (.0) 120 # 1) Do all the linear projections in batch from d_model => h x d_k 121 query, key, value = \ --> 122 [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2) 123 for l, x in zip(self.linears, (query, key, value))] 124

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, *kwargs) 1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1193 or _global_forward_hooks or _global_forward_pre_hooks): -> 1194 return forward_call(input, **kwargs) 1195 # Do not call functions when jit is used 1196 full_backward_hooks, non_full_backward_hooks = [], []

/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py in forward(self, input) 112 113 def forward(self, input: Tensor) -> Tensor: --> 114 return F.linear(input, self.weight, self.bias) 115 116 def extra_repr(self) -> str:

OutOfMemoryError: CUDA out of memory. Tried to allocate 184.00 MiB (GPU 0; 14.76 GiB total capacity; 13.17 GiB already allocated; 45.75 MiB free; 13.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

herobd commented 1 year ago

Sorry, that's just a hardware limitation. You should be able to use a smaller image size through, as the model was trained with 1152 x 768 images.

ArsalanAli915 commented 1 year ago

Do there is any solution if the image size is greater than 1152 x 768?

herobd commented 1 year ago

You either need to reduce the image size or get a bigger GPU

ArsalanAli915 commented 1 year ago

could you please tell what metric you were using for Handwritten dataset.

also How dessurt is better than donut. since donut is end-to-end model.

herobd commented 1 year ago

I was using word error and character error. Dessurt and Donut are both end-to-end model. Dessurt was better than the initial Donut release on arXiv, but the published Donut is better than Dessurt.