Output embeddings dimensions

When loading the FashionCLIP model from HF using only the image encoder, like this:

fashion_clip = 'patrickjohncyh/fashion-clip'

model = CLIPVisionModel.from_pretrained(fashion_clip)
processor = CLIPProcessor.from_pretrained(fashion_clip)

image = Image.open("image1.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
numpy_outputs = outputs.last_hidden_state.numpy()

This outputs an array of shape (1, 50, 768).

Two concerns here:

Does this mean that the FashionCLIP image encoder divides the image in 50 patches? Why isn't the original architecture of CLIP kept for FashionCLIP?
Also, are these the patch embeddings? Or are they a result after projection layers that reduce dimensionality to 768? Because the original CLIP model outputs dimension 1024 embeddings right after the Vision Transformer Encoder

Thanks

patrickjohncyh / fashion-clip

Output embeddings dimensions #34