Open finitearth opened 1 year ago
I have an engineering exam bank of about 1000 questions with simple illustrations. I have the questions already in JSONL format but some of them rely on the image to answer correctly.
Currently our API doesn't support vision, but if it does we'll definitely add support for that to this framework!
Are there plans to evaluate the vision modality of GPT-4? I am interested to know how GPT-4 could perform on classification tasks with 0- and few-shot-learning and how it compares to vision-only models. If the few-shot-learning capabilities of LLMs translate to other modalities, this would be a real game changer.
Question out of curiosity: How was the vision-modality incorperated? Maybe similar approaches can be taken for other modalities, such as audio or video? Would be an interessting Open-Source project for sure :)