uakarsh / latr

Implementation of LaTr: Layout-aware transformer for scene-text VQA,a novel multimodal architecture for Scene Text Visual Question Answering (STVQA)
https://uakarsh.github.io/latr/
MIT License
50 stars 7 forks source link

Questions about pretraining and fine tuning #7

Open kobrafarshidi opened 1 year ago

kobrafarshidi commented 1 year ago

Hi. Thanks for great code. First of all , I am so sorry if my questions are very simple and basic. In the continue of checking your code I encounter an error and encounter some questions. I'd appreciate it if you could help me with it.

1) Is LaTr TextVQA Training with WandB 💥 the final program for finetuning? if yes ,when I start running LaTr TextVQA Training with WandB 💥 should I run LaTr_PreTraining before it? if no and it doesn't need , where is it?(because I think it is comment and No run and I think there isn't the result of pretraining in finetuning) if yes , Should I have to pre-run that with source`

run LaTr_PreTraining` ?I think there is not pretraining for finetuning Or maybe I'm wrong ?if I am wrong where is pre_training used in finetuning script?

2) which line code is training the final model and fine tuning and take a long time?

3) how could I test and predict our model? could I have that source?

4) I encountered an error in part of code execution in script LaTr TextVQA Training with WandB 💥 and I send that error. Do you have any suggest for it? It is worth mentioning that I am running code in colab because I can't run code in kaggle in our country and I change wandb with code . Is this the reason for couldn't seeing checkpoints and trainer.fit not working?

%pip install -q wandb
import wandb
wandb.login()

The error is in the following line and I think This error made me unable to see progress train , checkpoint . I'd appreciate it if you could help me very much and give me a guidance.

if __name__ == "__main__":
    main()
Downloading: "https://www.kaggleusercontent.com/kf/99663112/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..GfuZWkqwWi9nROCTnAS3OQ.YowTb3CNlES2WS_F6BvOSrGs3uLWc2kSBkhElYUcndML0Feuiizdu8trA2e4aj_kdluv1nYlVpS3_86VaJfgSBtyJShQoB0CyxCqdvdMiKl4eQQdWUv2XrTBecEJPXupdFaElzr57CcRjpz35rueyDjf3GVJLznkpSdoyWwSxoxCACbUpS73PKWi97WHfPmEWQgXTDxT_Uno_Pau6fayKyzJ-vWrETzOA2Z6f1-i7umK48D7JBQacS2g_40dW8wIH34QsztCZhHOake7qZnXU_19qaFeDQCNldZ4HcGAmKMtqYI_NK_By370IZ6OHe5Q-mh1f_9SaZoXCzzgaNx4Wsw1THZgzSjZgP2dTLP6a4ZkjHFWiZdkl0azvmoCmSVVYbRdQ9_iI9sFvhUpDWj1bOlr-Zrq9gRi8ksaH9rIzrzk63x_fKPGphZKpxB_l_6iewdGt4yb3GB8kWyGrxBnsGvV5Ei7gTaqv9OAkSKTACMEKB-rj-T8HKtk3ktnEqGMCpHTpkB8RYE6EqYRPbnSYMShjZb12GSn5uYntLtcG7MUbQX-OMt0vzh9fag_zpCyO89K56jxZ6Q9kWdADG0C2T0nR8uC8vWUUBptWNc2tt6pcupcUO19kt7ddNHMbxajHym5AijizrfJbkqnujEodlHWc8C77PawpX2xUPvIlbSvhbdsRRyYfOFGLmZsDdKa.c9dgiKXE5w_-qo4J3He6Qw/models/epoch=0-step=34602.ckpt" to /root/.cache/torch/hub/checkpoints/epoch=0-step=34602.ckpt
Could not load checkpoint
uakarsh commented 1 year ago

Hi @kobrafarshidi, all the answers would be related to the notebook LaTr TextVQA Training with WandB.

  1. Actually, I focused more on the fine-tuning part/ training from scratch on a given task without pre-training, since pre-training would be resource-intensive and I was not sure about generating the pre-training results from the dataset, but I had added a reference code in the LaTr_PreTraining
  2. The entire code LaTr TextVQA Training with WandB is the code for fine-tuning (if I understood your question correctly). And yes, it took around 6-8 hours on GPU (I don't remember it perfectly), however, I have made the code such that, you can try it on TPU with a single line of code change. I guess colab would be really a headache since it doesn't allow background execution until you have a pro version.
  3. You can load the lightning module's weight by using the trainer.load_from_checkpoint and passing the URL that you mentioned in point 4, and then use the trainer.predict function for prediction of the values. More on this can be found out here I guess that maybe, it is a bit complex right now, but you can surely ask me/community for the help on this stuff.
  4. The reason for not being able to download the weights is, the URL of the weights keeps changing frequently, so the links get expired soon. Currently it is this link Hope this helps.

Regards, Akarsh

kobrafarshidi commented 1 year ago

Hi Mr Akrash, I'm so thank you for taking the trouble to answer all my question. I know you are busy and thanks to took the time to respond quickly. yes all of them are related to that note book LaTr TextVQA Training with WandB.

1) Your code is great and that is reasonable . you added that reference codeLaTr_PreTraining but for me is vital to know how could connect between pretrain and fine tuning and I have not any experience about connection between pretrain and fine tune and use data from pretrain in finetune script because in colab you can run only one .ipynb script and they are two separate files but on the other hand, they are related. then again it seems Mr Furkan added dataset of Amazon OCR ...

2) I totally understand

3,4) Thank you so much . I will definitely follow all these instructions. Many thanks for your checking .

With gratitiude Farshidi

uakarsh commented 1 year ago

Hi there,

  1. For the connection between the Pre-training and Fine-tuning, you can visit at this link. It simply goes that, if you have the initial weights trained, you can pass the address as an argument to the location, where the weights are saved, else it is None. So, while initialization the object of Latr_for_finetuning, if you have the weights saved, it would initialize it, else it would train it from scratch. So, I guess this would be handy.

And, if there is any additional dataset, I have tried to make the function create_feature, (here), in such a way that, if there are OCRs available, you can pass them as an argument, and rest all will be handled by the function.

Regards,

kobrafarshidi commented 1 year ago

Hi again Mr Akrash,

Thank you so much. I get all your guidance and I'll do all of them.

With gratitiude

kobrafarshidi commented 1 year ago

Hi Mr @uakarsh, Couple of weeks ago I ask you how to use pretrain in training_with_wandb file I study your guidance and do that step by step but when I want to apply pretrain in training_with_wandb file in a new block I encounter some ambiguities and some question. 1) In the following line of codes training_with_wandb file you equal ocr_json_df[image_id] and json_df[image_id]. Image_id is in training_with_wandb file but it isn't in Latr_pretrain file . how to create it for equalization between ocrs and images to find ocr=image ids? curr_img = self.json_df.iloc[idx]['image_id'] ocr_token = self.ocr_json_df[self.ocr_json_df['image_id']==curr_img]['ocr_info'].values.tolist()[0] 2) In training_with_wandb file for extract bounding box it uses some mathematics function in the following line for example rotation . Is it important to use them in pretrain block? for entry in ocr_token: xmin, ymin, w, h, angle = entry['bounding_box']['top_left_x'], entry['bounding_box']['top_left_y'], entry['bounding_box']['width'], entry['bounding_box']['height'], entry['bounding_box']['rotation'] xmin, ymin,w, h = resize_align_bbox([xmin, ymin, w, h], 1, 1, width, height x_centre = xmin + (w/2) y_centre = ymin + (h/2) xmin, ymin = rotate([x_centre, y_centre], [xmin, ymin], angle) xmax = xmin + w ymax = ymin + h curr_bbox = [xmin, ymin, xmax, ymax] boxes.append(curr_bbox) words.append(entry['word']) 3) In pretrain file when it had been found masked_boxes, masked_tokenized_words, tokenized_words it is not last step and in the next step we have LaTr_for_pretraining and then pre_training_model and finally we have only extrancted_feat_from_t5 (it means we have not img,boxes,tokenized_words, idx like in wandb file) so how to find them after pre_training_model to use them in next step in finetuning?

4) In the article writer mentions that in pretraining step we extract feature from pdfs dataset with pretraining with T5 then we use that features in fine tune and features are boxes, word tokens. Did I understand true? (since in picture in article shows in finetune it doesn’t use feature and we use all of pictures again in fine tune). It will be very vital to know this information and I will be very grateful for your helps and spending time for my questions.

uakarsh commented 1 year ago

Hi there,

  1. Actually in the pre-training part, I took a sample dataset, and hence the definition between the pre-training and fine-tuning code differs. The simplest way to use the code of fine-tuning for pre-training is to take the entire TextVQA dataset code from fine-tuning and put it in pre-training and then just modify the __getitem__ code and introduce the line:

_, masked_boxes, masked_tokenized_words = apply_mask_on_token_bbox(boxes, tokenized_words), after tokenized_words = torch.as_tensor(tokenized_words, dtype=torch.int32)

  1. For the rotation part, actually in the ocr info of the TextVqa file, there was a key corresponding to the bounding box, named spline and rotation. So, I introduced the rotation and spline of the bounding box, but actually, it was of no use, since the rotation and spline deal with 3D coordinate space. Hence, unless the rotation and spline are mentioned to be in 2D space, it is okay to remove those parts.
  2. If you want to use the features in the fine-tuning stage, maybe you can refer to the code related to training LaTr for question answering (and hence the only difference is that the __getitem__ definition would change). If you want to extract the masked indices, you can refer to the function apply_mask_on_token_bbox and return the indices of the masked value, I mean simply add the variable temp into a list and return the list. Hope this helps.
  3. Yes, as per what I understood from your statement, the authors do not use images for pre-training the model but include the same during fine-tuning

Regards,

kobrafarshidi commented 1 year ago

Hi Mr Akarsh , Thank you so much for answering my questions and I understood some of the questions, but some of them Unfortunately, the ambiguities have not been resolved for me yet. In first question you mention that take the entire TextVQA dataset code from fine-tuning and put it in pre-training but based on the article we didn’t want to use TextVQA dataset in pre train and we should use IDL dataset in pretraining and then connect between pretrain and finetune and when I want to use IDL I don’t know how to fine equal Ids. It is my question how to do that.

The third question, Do you mean that boxes , tokenized_words ,idx are equivalent to masked_boxes , masked_tokenized_words , temp?

Gratefully,

uakarsh commented 1 year ago

Hi,

In the first question, I don't think you need to find anything, you can download the IDL dataset, and then take reference to the code of pretraining for extracting the OCR for the whole dataset and then mask them. Actually, if you are focusing on the pre-training part, I think, matching the fine-tuning and pre-training code would create confusion.

In the third question, what I was trying to say, is once you have extracted the features (i.e from the create_features method, you can then use the apply_mask_on_token_bbox for the extracted feature for masking and performing pre-training.

kobrafarshidi commented 1 year ago

Hi, I am so thankful for giving your valuable time for my questions You are right, matching the fine-tuning and pre-training code would create confusion but it is the main point of latr research and if we want to use this research to check it or even improve it we should use both of them synchronous. I have already mentioned that this is vital to me to know.If your IDL dataset is simple What do you think about using idl_Amazon_ocr( The same link that Mr.Farkan gave in other issue) I check it to use for pretrain but it's really still confusing for me to know to equate image id and ocr Id for running line code ocr_json_df[self.ocr_json_df['image_id']== json_df.iloc[idx]['image_id']]['ocr_info'] and I don’t know what is ocr_info in this dataset (I attach a picture of this json for you. Ofcourse, I may be completely wrong) amazon ocr

uakarsh commented 1 year ago

Hi,

I am not able to open the link. But, I think the essence would be to write a function, which can read the bounding boxes, and words for a given PDF, and then pass it to the create_features function, for extracting the bounding box. Maybe, this helps for you

Regards, Akarsh

kobrafarshidi commented 1 year ago

Hi Mr @uakarsh Thank you so much took the time to respond me. Actually my problem is part of equal image_id of images and image_id of ocr tokens. I would like to know your opinion about it. Best regards

uakarsh commented 1 year ago

I think in that case, maybe you have to find out a way to create a CSV file, in which there is an image entry and the corresponding ocr path of that image id. But, I guess this is not the case with your dataset. So, is it possible, if you can read the file (that you mentioned in your previous reply to this thread), and access the TextDetections part for the corresponding ID? If so, I think the problem is solved.

kobrafarshidi commented 1 year ago

]n part of TextDetections I see only id if you mean that id. I think it is number of detected text in one image for example in one image we have 4 id number in other image we have 10 id number so this id is not per image and I attach that example. we want one image-id for each image . on the other hand we have not that pdfs and we have just only json file. What do you think about it? 321

uakarsh commented 1 year ago

I really am not sure, about how to go, unless I get a few samples and then proceed. But, what I can understand is, you need to do something so that, the image_ud and its corresponding ocr from the dataset can be extracted.

kobrafarshidi commented 1 year ago

Hi, Mr @uakarsh I want to run my project with jupyter notebook with GPU because colab crashed so I change my system. but when I run on gpu gtx 1080 with 8 memory it has error out of memory. And someone guide me that I should change version of pytorch and cuda and python .Can I ask you what is the best version of Pytorch, Python, Cuda for run it?

uakarsh commented 1 year ago

Hi,

I guess versions won't have a role to play, but here are the following things, that could help

While constructing the pytorch lightning trainer object, you can do the following:

  1. Auto tune the batch size, so that lightning can automatically find the best batch size for you
  2. Mixed precision
  3. Accumulate gradients

and many other tricks. All these you can find on the pytorch lightning trainer page. This was the reason, why I use pytorch lightning

Hope, it helps

kobrafarshidi commented 1 year ago

Hi, That's right. I will test them. Thank you so much for your guide. Best regards

kobrafarshidi commented 1 year ago

Hi Mr @uakarsh As you guided, I did these things according to your settings(auto batch size, precision,....), but in the end I got an out of memory error. I think parameters is very huge and my GPU has only 8 Rams so I decided to run the program on a system with 2 GPU. Now, if I want to run this program on 2 GPU, what changes should I make to theLaTr TextVQA Training with WandB 💥 code? Based on what I study, it is enough to write in the code in the pl.trainer in parameters GPUs=2 but with these settings, I encountered error . I would be grateful if you could show me a solution. many thanks.

uakarsh commented 1 year ago

What was the error when you ran on 2 GPUs?

kobrafarshidi commented 1 year ago

Hi, that error is To use CUDA with multiprocessing, you must use the 'spawn' start method and when I encounter this error, I change accelarator=ddp_spawn but in source code, this parameter is invalid and I changa strategy=ddp_spawn but I encounter this error that this strategy is incompatible with this source....

uakarsh commented 1 year ago

I think, maybe this link would be helpful. Link: https://pytorch-lightning.readthedocs.io/en/1.4.0/advanced/multi_gpu.html

Maybe the ddp_spawn strategy is not applicable for the resources that you are using

kobrafarshidi commented 1 year ago

Hi, Thank you so much for your responce I had read this guide, the author pays attention to three points 1) Init tensors using type_as and register_buffer 2) Make models pickleable 3) Select GPU devices, Distributed Data Parallel

1,2) I really don't know how to do it. If I'm not mistaken,Tensors in source code are img, questions, answers, tokenizers and I run with follow and I have error I think it's my wrong

    a = torch.Tensor(3, 384, 500)
    img = a.type_as(img)    
    b = torch.Tensor(256, 6)
    boxes = b.type_as(boxes)      
    c = torch.Tensor(256)
   tokenized_words = c.type_as(tokenized_words)   
   d = torch.Tensor(512)
   question = d.type_as(question)   e = torch.Tensor(512)
   answer = e.type_as(answer)  
   f = torch.Tensor(0)
    idx = f.type_as(torch.as_tensor(idx))

3) for it I test different kind of setting and most of them was incompatible but one of them it hanged in traner.fit and only show local_rank: 0 ....and I don't know the reason of it. I attach the screen of it

123