Open morganmcg1 opened 1 year ago
Yes, there were more \
I've trained this model for a few days on a Colab Standard GPU. Only had limited time and GPU resources for this experiment. Applied for a HF community grant but have not heard back.
At the time the latest checkpoint was pushed , the model was nowhere close to convergence. Hope if people are interested in it, someone else will chip in with further training and bug fixes.
It can be argued that a dual visual-language encoder model is a more popular architecture for this task (open vocabulary object detection). My rationale was that if we can get it to learn bounding boxes then it can be extended to downstream tasks such as action classification plus bounding box and even input values in one inference pass instead of using multiple models for each task. Further it could potentially learn the reciprocal task of providing unambiguous component reference expression given a bounding box and a screenshot. All just theories for further research. :)
love it, I'm spending a little time on it in my spare time, right now just verifying the data processing steps and pulling a bunch of it out of the pytorch dataset class to simplify things a little. Will share some work when I have some. For the unk tokens above I'm adding them to the tokenizer to see if it makes a difference - i'm guessing it won't, but it nags me that there might be some coordinates that the model will ignore :D
Sounds great. Happy to collaborate. Feel free to submit a PR if you like. I should be able to review within a day or two.
BTW, \
In fact, this reminds me that the current training assumes only positive samples. All dataset samples have a bounding box. In reality a refexp may be nonsensical in which case the model should respond with \
Yet another related issue that I just remembered - there are some samples in the synthetic RICO SCA dataset (starting at around index 16,000) that use referring expressions to text labels that occur multiple times in the screenshot. We can look at ways of removing them so they don't confuse the model or potentially allow the model to predict multiple bounding boxes.
Oh yeah negatives is a good idea.
I opened a PR on transformers to fix training with batch sizes > 1 which got merged:
I also opened one on the HF Hub to correct their normalization in image_processor as the original authors used ImageNet normalization from what I can see:
I'm going to do all the image processing without the image_processor as it was also messing up the bounding box positions when images were resized
- Fix for non-contiguous label tensors in VisonEncoderDecoder huggingface/transformers#21582
So cool. I ran into it but was not sure how to fix correctly. Used grad_accumulation_batches
as a workaround.
I'm going to do all the image processing without the image_processor as it was also messing up the bounding box positions when images were resized
So glad you caught this. I was suspicious the resizing left out certain portions of the screenshots in some cases.
Hey! Enjoying playing around with your notebooks, thanks for sharing!
Have you noticed that the tokenizer seems to struggle with some bounding box values and will return an token?
will return:
Note the "unk" token
3
Checking down to 3 decimal places, these input strings all return an token
Getting the exact problematic parts of the string:
It struggles with: 0.1, 0.121, 0.131, 0.161, 0.171, 0.181, 0.191