pytorch / torchtune

PyTorch native finetuning library
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.37k stars 446 forks source link

2D RoPE + CLIP updates #1973

Closed RdoubleA closed 1 week ago

RdoubleA commented 2 weeks ago

Context

Two-dimensional rotary positional embeddings have been added to vision transformers to improve performance. This was explored in papers such as Eva-02-CLIP, which found that 2D RoPE improved performance and had more stable training as opposed to 1D RoPE for images. Another novel architecture FiT (Flexible Vision Transformer for Diffusion Model) similarly employs 2D RoPE for image resolution generalization. Pixtral, a multimodal LLM, also uses a similar 2D RoPE mechanism as well, as seen in Hugging Face. A full survey of 2D RoPE can be found in this paper.

Here, we add VisionRotaryPositionalEmbeddings as a general component. The forward is identical to RotaryPositionalEmbeddings; the major difference is in the build_rope_cache, where we need to take into account the x and y positions of the patches in the image grid, as defined by patch_size and tile_size. It simply applies 1D RoPE with half the usual dim on x and y independently and concatenates them together.

This is exposed as use_rope in the clip_vision_encoder builder, which will enable 2D RoPE. I also include some various fixes for CLIP and additional configurability

Changelog

What are the changes made in this PR?

Test plan

pytorch-bot[bot] commented 2 weeks ago

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1973

Note: Links to docs will display an error until the docs builds have been completed.

:heavy_exclamation_mark: 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

:white_check_mark: No Failures

As of commit 77ecb82a3ae5e5ab730e2f0898db4fa1b9985c18 with merge base bca5899480f54ebb85fea16231707ec36ee606ad (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.