nvaccess / nvda

NVDA, the free and open source Screen Reader for Microsoft Windows
Other
2.09k stars 628 forks source link

OFFER: SMART IMAGE RECOGNITION #16281

Open haruncetinkaya306 opened 6 months ago

haruncetinkaya306 commented 6 months ago

Is your feature request related to an issue? Please define.

Smart image recognition feature was introduced in the March update of Jaws. This feature uses Chat GPT and Google Gemini to describe photos in detail and eliminates all accessibility issues related to photos. Especially Chat Gpt does this job very well.

Describe your desired solution

It would be nice to bring such a feature to NVDA. I know that those who follow screen readers closely have noticed this feature. I think there will be such a demand for NVDA in the future when all users are aware of this feature.

Describe the alternatives you considered

None.

Additional context

XLTechie commented 6 months ago

For the moment, there are multiple add-ons attempting to do this. Have you tried them?

I know it is not the most satisfying solution.

Also, does Jaws charge extra for this? Chat GPT tends to have a cost.

XLTechie commented 6 months ago

Actually there are only three in the store.

AI Content Describer; Open AI; XposeImage Captioner.

Some will be incompatible depending on your NVDA version, and you will have to use the override compatibility option to try them, or use a version from the beta or dev channels.

haruncetinkaya306 commented 6 months ago

@XLTechie Using this feature in Jaws is free. I don't understand, which of these is possible to describe photographs?

XLTechie commented 6 months ago

I have not used any of those, but probably AI Content Describer is the most likely to work. This is a guess.

haruncetinkaya306 commented 6 months ago

It's nice to have this as a plugin, but I would request that this feature be added to NVDA's core. @gerald-hartig

Emil-18 commented 6 months ago

@haruncetinkaya306 Maybe they slightly increased the price of JAWS to acomidate for this new functionality?

seanbudd commented 6 months ago

@haruncetinkaya306 - hi, please do not encourage or discuss illegal software usage here, thanks

Adriani90 commented 6 months ago

This is something that should be and is already successfully held in an add-on due to following reasons:

  1. Maintenance efforts are high (i.e. updating available AI models, API interfaces etc). For example making requests in Haiku works very different to making requests in Chat GPT. So expert knowledge is very crucial and the maintenance depends currently on a very experienced person in this field, but the knowledge is not yet widely built in the whole community. The add-on AI Content describer is very good at describing things already and the author is very commited to release regular updates, last release was on 14 March this year: https://github.com/cartertemm/AI-content-describer/ Also the add-on has already a very active community on audio games: https://forum.audiogames.net/topic/50203/ai-content-describer-for-nvda/
  2. This will definitely generate costs for users which contradicts NVDA's core philosophy of using Windows at no more cost than sighted people. For example, approx. 733 thousant words cost approx. 1 million tokens which is roughly 10 USD in Chat GPT-4. Depending on the promt users create for the requests, this could end up in higher costs.
  3. There is Microsoft's copilot which could server as an freeware alternative but the results seem quite low quality
  4. There are still a lot of discussions going on about data privacy and security concerns related to AI models and this can affect negatively NVDA usage in corporate environments if this was introduced in the core.
haruncetinkaya306 commented 6 months ago

@Adriani90 If it is true that this feature is paid, of course it should not be added to NVDA's core. Because this goes against the reason why NVDA was created. However, what I don't understand is how a screen reader that has this feature can do this for free. Because I do not encounter any fee requests when using this feature. Jaws already had a smart image recognition feature in 2019. However, it provides limited information. Now this feature sends images to artificial intelligence. He gets the detailed description from them free of charge.

haruncetinkaya306 commented 6 months ago

Hello, I just checked the feature again, when I set Jaws to run for 40 minutes and try it, it says activate the license. This may mean that there is a fee.

seanbudd commented 6 months ago

Just to be clear, if we were to integrate this into NVDA it would have to be a free component. This is not impossible, the most likely options at this stage would be to install open source AI models as an optional component of NVDA or using an inbuilt Windows service like used for OCR if one is created for CoPilot. Both of these would avoid making internet requests to analyse the image. Alternatively, we would have to find a free internet based service, but unless there was strong user trust in such a service handling the data, this would probably best serve as remaining an add-on.

Adriani90 commented 6 months ago

I think it is good to let this open since the first alternative you propose sound atractive even though far away for the near future. I think using either an integrated windows service when available or building an open source AI model into NVDA installation seems the best solution, other internet based requests will not be accepted especially in corporate environments.

Adriani90 commented 6 months ago

Be my Eyes with its Be my AI is currently working on Windows integrations, so picture description with Be my AI will be available also on desktops / laptops. So having this feature built into the screen reader will not be really necessary in the near future. Moreover, Be my AI is focusing on this area as well, so improvements to picture descriptions can be proposed to be my eyes developers which is more efficient I guess.

gerald-hartig commented 6 months ago

NV Access had similar questions asked of us at CSUN around when we would be bringing AI image recognition into the core. There's a lot to say on the topic, but very briefly our position at the moment is that we're very excited by the potential benefits that this technology can bring and the add-on system is a marvellous way of exploring these benefits and experimenting on how best to integrate them into a screen reader.

We feel that bringing this functionality into the core requires the technology to first tick the following boxes:

  1. Private and low cost
  2. Fast and responsive
  3. Global language support
  4. Accurate and useful

In my personal opinion it feels like every day the technology is improving and that none of these pre-requisites will prove to be intractable.

Adriani90 commented 4 months ago

Finally, be my AI is now available in the Windows store and it works amazingly well, also with describing the screen, not only images. It can also summarize things etc.: https://www.bemyeyes.com/blog/be-my-eyes-desktop-app-on-windows

To be honnest, if nVDA brings any AI feature to the core, it should not really focus on images description, but rather focus on navigation assistance on the screen, e.g. navigating and performing actions in inaccessible software etc. Image description is now with be my AI on Windows really nice feature which is for free. NVDA developers could rather invest time in making AI really help in inaccessible places, e.g. telling NVDA to draw something on a concept board, whereby NVDA would draw it for you with the mouse. This would bring much more value than the pure image description feature.