luogen1996 / LLaVA-HR

LLaVA-HR: High-Resolution Large Language-Vision Assistant
Apache License 2.0
202 stars 9 forks source link

requests.exceptions.ConnectTimeout: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /timm/convnext_xxlarge.clip_laion2b_soup_ft_in1k/resolve/main/pytorch_model.bin (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f48e9d64670>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 52084a1b-5e0c-4d52-bcfb-41bf34c52df6)') #8

Closed gapjialin closed 5 months ago

gapjialin commented 5 months ago

Hello! I have downloaded convnext_xxlarge.clip_laion2b_soup locally and changed the load method to --vision_tower_slow in train_eval_llava_hrx.sh /home/LLaVA-HR/convnext xxlarge.clip_laion2b_soup \ But still reporting errors: requests.exceptions.ConnectTimeout: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port= 443): Max retries exceeded with url: /timm/convnext_xxlarge.clip_laion2b_soup_ft_in1k/resolve/main/pytorch_model.bin (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f48e9d64670>, 'Connection to huggingface.co timed out. (connect timeout =10)'))"), '(Request ID: 52084a1b-5e0c-4d52-bcfb-41bf34c52df6)'), where is the problem?

luogen1996 commented 5 months ago

This means there is a problem with the network connection. You may need to connect to a VPN or try again.

gapjialin commented 5 months ago

This means there is a problem with the network connection. You may need to connect to a VPN or try again.

I want to try to load convnext_xxlarge.clip_laion2b_soup locally, I have downloaded it locally, can you give me some guidance?

luogen1996 commented 5 months ago

This is because your network cannot connect to the huggingface server. As far as I know, even loading weights locally may still require access to huggingface. I recommend you to use a VPN such as pigcha.

gapjialin commented 5 months ago

Thanks for the answer! I would also like to ask a question, when doing the fine-tuning task on my own data, is it only my data that is fine-tuned, or is it my data plus the COCO: train2017 GQA: images OCR-VQA: download script, save all files as .jpg TextVQA: train_val_images VisualGenome: part1, part2 what about these data? Or after you fine-tune the model and then fine-tune my data? Which do you recommend?

luogen1996 commented 5 months ago

I recommend you to mix all data together and train your models. In my experiments on VQAv2, this was the best way. Besides, more training epochs may be beneficial.

gapjialin commented 5 months ago

Thank you for your patient answer.

gapjialin commented 5 months ago

I recommend you to mix all data together and train your models. In my experiments on VQAv2, this was the best way. Besides, more training epochs may be beneficial.

Hello, there is currently a self built dataset for 80K object detection, which is used to detect the position of objects in the image. The image size is 1920x1080. When I use mixing all data for fine-tuning, it is difficult to detect the position of objects in the image. What is the problem? The machine I am using is 8xA40, and train_eval_llava_hr_x.sh is used for fine-tuning. My dataset example is as follows: human: Verify if there is a presence of people in the image. gpt: There are 1 people. human: Pinpoint and describe the exact spots where each person can be found in this picture. gpt: person 1's bounding box coordinate of the region is [0.32, 0.65, 0.34, 0.75].

luogen1996 commented 5 months ago

Make sure box coordinates are correctly processed. Note that our image is padded so the box coordinates should be processed accordingly.

gapjialin commented 5 months ago

Make sure box coordinates are correctly processed. Note that our image is padded so the box coordinates should be processed accordingly.

By reading your paper, I noticed that it mentions High resolution Instruction Training, which is exactly where I want to learn. If the High resolution Instruction Tuning is 1024x1024 and my dataset image size is 1920x1080. For example, the normalized target boxes for a certain image in my dataset are [0.32, 0.65, 0.34, 0.75]. How should I handle it? Looking forward to your answer, thank you!

gapjialin commented 5 months ago

Make sure box coordinates are correctly processed. Note that our image is padded so the box coordinates should be processed accordingly.

My idea is to convert the coordinates of the target box in the original 1920x1080 resolution image to a resolution of 1024x1024, maintain the position of the target box after proportional scaling and center alignment, obtain a new normalized target box, and then perform High resolution Instruction Training. Is this understanding correct?

luogen1996 commented 5 months ago

Please refer to our preprocessing function and visualize the box coordinates for debugging. def expand2square(pil_img, background_color): width, height = pil_img.size if width == height: return pil_img elif width > height: result = Image.new(pil_img.mode, (width, width), background_color) result.paste(pil_img, (0, (width - height) // 2)) return result else: result = Image.new(pil_img.mode, (height, height), background_color) result.paste(pil_img, ((height - width) // 2, 0)) return result

gapjialin commented 5 months ago

Please refer to our preprocessing function and visualize the box coordinates for debugging.

Thank you! Let me tryyy!

gapjialin commented 5 months ago

Please refer to our preprocessing function and visualize the box coordinates for debugging. def expand2square(pil_img, background_color): width, height = pil_img.size if width == height: return pil_img elif width > height: result = Image.new(pil_img.mode, (width, width), background_color) result.paste(pil_img, (0, (width - height) // 2)) return result else: result = Image.new(pil_img.mode, (height, height), background_color) result.paste(pil_img, ((height - width) // 2, 0)) return result

Thank you! I noticed that during fine-tuning, there is a description of the coordinates in the image in the dataset. Does this coordinate correspond to the image before or after compression resolution? I really want to know this question.

gapjialin commented 5 months ago

Please refer to our preprocessing function and visualize the box coordinates for debugging. def expand2square(pil_img, background_color): width, height = pil_img.size if width == height: return pil_img elif width > height: result = Image.new(pil_img.mode, (width, width), background_color) result.paste(pil_img, (0, (width - height) // 2)) return result else: result = Image.new(pil_img.mode, (height, height), background_color) result.paste(pil_img, ((height - width) // 2, 0)) return result

I tried adjusting my data set to 1024x1024 without changing the aspect ratio, and matching the coordinates in the dataset with the image with the changed resolution. After completing these, I made further adjustments and found that the results were not very good, and the model could not provide the correct coordinate answer. Where did the problem arise?