mdn / yari

The platform code behind MDN Web Docs
Mozilla Public License 2.0
1.17k stars 487 forks source link

The AI help button is very good but it links to a feature that should not exist #9230

Open nyeogmi opened 1 year ago

nyeogmi commented 1 year ago

Summary

I made a previous issue pointing out that the AI Help feature lies to people and should not exist because of potential harm to novices.

This was renamed by @caugner to "AI Help is linked on all pages." AI Help being linked on all pages is the intended behavior of the feature, and @caugner therefore pointed out that the button looks good and works even better, which I agree with -- it is a fantastic button and when I look at all the buttons on MDN, the AI Help button clearly stands out to me as the radiant star of the show.

The issue was therefore closed without being substantively addressed. (because the button is so good, which I agree with)

I think there are several reasons the feature shouldn't exist which have been observed across multiple threads on platforms Mozilla does not control. Actually, the response has been universally negative, except on GitHub where the ability to have a universally negative response was quietly disabled Monday morning.

Here is a quick summary of some of those reasons.

One, the AI model is frequently wrong. Mozilla claims it intends to fix this, but Mozilla doesn't contain any GPT-3.5 developers and OpenAI has been promising to fix it for months. It's unlikely this will actually happen.

Two: contrary to @caugner 's opinion, it's very often wrong about core web topics, including trivial information where there is no obvious excuse. Here are some examples:

Even examples posted by people who support the existence of the AI contain significant errors:

(I say examples, but note: this is the only usage example provided by a person who supported the existence of the feature, and it contained an error.)

This is identical to one of the categories of problem seen on StackExchange when StackExchange introduced its generative AI assistant based on the same model, and it led to Stack removing the assistant because it was generating bizarre garbage.

Three: it's not clear that any documentation contributors were involved in developing the feature. Actually, it's still unclear who outside of @fiji-flo and @caugner was involved in the feature. Some contributors including @sideshowbarker have now objected and the process has produced a default outcome, which is that AI Explain was voluntarily rolled back and AI Help remains in the product.

It is probably OK for those contributors to review each other's own code, but they're also managing the response to the backlash. After a bunch of people have already signaled "hey, I have an active interest in this feature" by engaging with a relevant issue, excluding those people reflects that a ruling of "actually, you do not have an active interest!" has been reached, and it's not clear what basis that ruling would have been reached on.

Four: the existence of this feature suggests that product decisions are being made by people who don't understand the technology or who don't think I understand it.


Overall, the change tells the story that MDN doesn't know who their average user is, but assumes that the average user is (1) highly dissimilar to the GitHub users who were involved in the backlash (2) easy to sell to.

The fact is that in one day, measured in upvotes, you attracted comparable backlash to what the entire StackOverflow strike attracted in a month. It would be a mistake to think only a small group of people are concerned. This attitude would be wishful thinking.

It seems like the fork in the road for MDN is:

If option 1 isn't sustainable, then between option 2 and option 3, option 3 is obviously better for humanity in the long-run and I would encourage MDN to make plans for its own destruction.

In the worst possible world, the attitude is correct and the users are easy to sell to. Well, in that case, you've created another product company and in doing so you've metaphorically elected to serve both God and money -- and as is evidenced by the recent implosions of every siloed social media company, that is always a great idea.


Again, the AI Help button is absolutely gorgeous and functions as intended. This issue is not about the AI Help button and therefore should not be closed as a button-related wontfix, or renamed by @caugner into a description of the behavior of the button.

URL

https://github.com/mdn/yari/issues/9208 https://github.com/mdn/yari/issues/9214

Reproduction steps

Pivot to a more aggressive funding model, then engage in a mix of panic and corporate groupthink.

Expected behavior

I think the button is amazing and you are doing a great job.

Actual behavior

The AI help feature should not exist.

Device

Desktop

Browser

Chrome

Browser version

Stable

Operating system

Windows

Screenshot

image

Anything else?

No response

Validations

caugner commented 1 year ago

Full disclosure, I deliberately tricked the LLM by asking how to use MutationObserver for this purpose. But IMO that's a question a confused beginner is likely to ask, and the documentation should correct them rather than hallucinate a world in which they are correct.

@faintbeep Thanks for being honest, and glad to hear you had to trick AI Help to get a seemingly incorrect answer. Could you please report the answer using the "Report a problem with this answer on GitHub" link to create a (public) GitHub issue for it? That issue will then contain both the question(s) you asked and the answer you received, which makes it easier to reproduce and follow-up. (So far we have received only 5 issue reports - all valid - since we added the link.)

It's important to mention that had you asked if you can detect size changes using MutationObserver instead (e.g. "Can I detect size changes with MutationObserver?"), AI Help would have told you that you cannot and pointed you to ResizeObserver. And my question "How can I detect size changes with MutationObserver?" was just rejected by AI Help. So I'm curious how you phrased that question.

It seems you insisted specifically on a solution with MutationObserver, and AI Help gave you what seems to me like a possibly valid solution to a subset of size changes (namely through style attribute changes, which may effectively change the size of an element), without mentioning this limitation though. Luckily there are the two links that allow the beginner (who, kudos, already heard about MutationObserver) to double-check, deepen their knowledge about MutationObserver and discover ResizeObserver through the "See also" section. Even if you don't find this helpful, maybe we can agree that there is some helpfulness in this?

But seriously, if you actually report this as an issue, we can look into what improvements can avoid this kind of scenario. For example, we could update the MutationObserver page to better explain its differences to ResizeObserver, or add an overview page for all the observers with their respective use cases could help maybe it already exists, then we could look into why it wasn't deemed relevant enough, and ensure it's passed as context). And last but not least, it's an option to update our system instructions to prevent GPT-3.5 from suggesting solutions using unsuitable feature, even if the user specifically asked for it.

PS: Just to make this clear once and for all, we are aware of the limitations of LLMs, and we know that the LLM doesn't understand the question or these instructions, and only uses statistics to come up with the next words. However, the crux is that it works surprisingly well, which is the reason why LLMs can provide value for users, why AI Help's answers are mostly helpful, and why we experiment with an LLM as part of this beta feature. The success of this experiment is yet to be evaluated, and all feedback is going to be taken into consideration.

acdha commented 1 year ago

However, the crux is that it works surprisingly well, which is the reason why LLMs can provide value for users

You’ve asserted this but not supported the claim. Even if we ignore the inaccuracies, the positive examples provided have mostly been disorganized and turgid, so I think the better way to convince people would be by having real human testimonials: survey learners in the target audience and see how helpful they found it for solving real problems.

meejah commented 1 year ago

You’ve asserted this but not supported the claim. Even if we ignore the inaccuracies, the positive examples provided have mostly been disorganized and turgid, so I think the better way to convince people would be by having real human testimonials: survey learners in the target audience and see how helpful they found it for solving real problems.

Definitely a better approach than asking LLMs to evaluate each other!

Perhaps this could be improved further: divide the target audience into two and give them all the same (short) task. One group gets to use only MDN for help and the other gets to use MDN + "AI Help". Have professionals evaluate the quality of the results from both groups.

acdha commented 1 year ago

Perhaps this could be improved further: divide the target audience into two and give them all the same (short) task. One group gets to use only MDN for help and the other gets to use MDN + "AI Help". Have professionals evaluate the quality of the results from both groups.

The sad part is that old Mozilla could have had volunteers to do this if they were training an open LLM and approached this as a research project without a predetermined outcome. As a former donor and contributor, “help Open AI pro bono” is just not as compelling a pitch.

Xkeeper0 commented 1 year ago

I decided to test the "ask it for something impossible and it will answer as if it was possible" thing above by asking a question I've had myself many times over the years: How do I use CSS selectors to select an element only if it contains a specific child element?

The AI response not only gets the asked question backwards, but then it answers the rewritten question (which misses the entire point)

image

A Google query, css select element if it has a specific child, gives its first result to this Stack Overflow answer; reading the results quickly instills the idea that there likely isn't anything usable just yet but might be in the future, and points to what it will likely be, the ":has" selector... amusingly, only not supported by Firefox right now.

For curiosity's sake I decided to reformat the question and try again; by this point I know it won't give me an accurate, correct answer, but once again it manages to get basic details wrong:

image

The only "trick" involved in this was asking it a question I already knew the answer to.

workingjubilee commented 1 year ago

How can anyone validate the information provided by an AI assistant if the sites they were supposed to validate that information against are the ones providing that "AI assistance"? How do they know who to trust? This problem most severely negatively affects those who do not have an abundance of spare time, energy, and knowledge to validate the output of AI tools, which are the people who most need assistance from things like MDN.

MDN's AI help establishes the baseline of trustworthiness of help from MDN, because it is so much lower than the rest of the site, and if it is trusted as a vector of information, there is no reason to believe such information has not been incorporated elsewhere on the site in less obviously perceived ways. No one is auditing the edit history of every single article here, and the obvious next step to happen is "the AI starts making edits". Now that you've made it clear you are happy to incorporate this tool into the text displayed for individual articles via "AI Explain", it's not enough to roll things back to "AI Help". The entire thing has to go, otherwise I have no reason to assume you're not just going to reimplement AI Explain later, when things quiet down, as everyone tends to. Thus in order for MDN to be useful, I will start having to audit the edit history of every article, which is harder for people to do now that it's a git history (git has notoriously poor UX).

Defensive maneuvers against misinformation should not cost more than the misinformation costs to generate. Otherwise the misinformation wins. Checking to see if "AI Editing" was enabled when I was away every time I reference or cite MDN is not cost-efficient. So the only defensive maneuver that makes sense is to assume you've abandoned your responsibility to providing reliable and accurate information, as that is the easiest explanation for why a tool that does not provide reliable and accurate information was incorporated into a website that does provide reliable and accurate information. "It generates value" is not enough if it raises the cost of using the resources on MDN overall.

nyeogmi commented 1 year ago

(Periodic reminder: this thread has literally no multiplier effect and the devs aren't listening to you. If you want anything to happen, post about it on a platform that has a multiplier effect.)

ToxicFrog commented 12 months ago

As announced in the Community Call invitation, we're sharing our answers for anyone who couldn't attend. We'll be adding them in the individual GitHub Discussions threads.

Is there a timeline for this? When can we expect answers and/or the transcript to be posted?

obfusk commented 12 months ago

I keep seeing the proponents of this conflate seeming to be helpful with actually being helpful and assuming that there is no meaningful difference between inaccurate information provided by well-meaning people (e.g. on stack overflow) and the kind of inaccurate information that an LLM can produce.

See my comment here.

DavidJCobb commented 11 months ago

It seems you insisted specifically on a solution with MutationObserver, and AI Help gave you what seems to me like a possibly valid solution to a subset of size changes (namely through style attribute changes, which may effectively change the size of an element), without mentioning this limitation though.

There is no world in which "tell me when inline styles change" or even "tell me when size-related attributes change" could ever be an adequate answer to "tell me when the size of a typical element changes." The latter is asking about an effect; the former focuses only on one cause among so, so very many. (And it's overbroad in its wrongness, too: it doesn't even double-check offsetWidth and so on for actual changes; every style change is assumed to be a size change!) Calling this "possibly valid" is a breathtakingly flimsy rationalization.

And last but not least, it's an option to update our system instructions to prevent GPT-3.5 from suggesting solutions using unsuitable feature, even if the user specifically asked for it.

Explain how.

PS: Just to make this clear once and for all, we are aware of the limitations of LLMs, and we know that the LLM doesn't understand the question or these instructions, and only uses statistics to come up with the next words.

You say this, but it directly contradicts your last remark. You can "update your system instructions" to overcome the fundamental nature of LLMs? You're acknowledging the limitations of LLMs but refusing to actually consider them, and this is evident in everything you've been saying: it's evident in you projecting confidence that with the right prompt, the right prayer to the toaster oracle, you can get it to reliably correct mistakes; it's evident in you assuming that someone definitely has to be acting in bad faith and insisting that your genius machine provide a wrong answer, for the machine to do so.

(The LLM provided a correct answer when you asked it, so clearly, it "knows" the answer, right? If it gave someone else a wrong answer, it must be because shenanigans are afoot. It can't be that innocent enough variations in wording or phrasing -- variations you simply haven't thought of and tested -- might trip up a program that reacts entirely and blindly to wording with no mental model of what words actually mean.)

And let's not forget the context of you failing to actually demonstrate the awareness you say you have: multiple GitHub issues with hundreds upon hundreds of comments' worth of explanations of LLMs' limitations, presented and explained in just about every way possible, in some cases with examples pulled from MDN itself.

At best, assuming good faith as hard as I can, you've shown an appalling level of myopia that should immediately disqualify someone from making or in any way being involved in any noteworthy decisions about how one of the web's most critical developer documentation sites should be run; but it's becoming increasingly difficult to believe that this is the thoughtlessness it looks like.

Xkeeper0 commented 11 months ago

I feel it's worth pointing out what one of the community call answers had to say: https://github.com/orgs/mdn/discussions/414#discussioncomment-6541058

It's MDN's fault for completely failing to listen to the community here and to consider them when developing new features for MDN, and that's why so many people felt the need to express their concerns.

An extremely vocal small set of our community is not the entire MDN community. We thank you for your feedback, and concern, and we’re taking substantial portions of it on board.

We're just "an extremely vocal small" minority, apparently, because anyone who simply hasn't responded clearly finds AI integration to be a flawless addition.

obfusk commented 11 months ago

And this feature was built for a subset of our community not particularly represented on the issues discussing this feature, and whom many people commenting on the feature entirely forgot about: learners and those not yet capable of finding the correct information on MDN.

I'm pretty sure we've actually expressed a lot of concern that adding more incorrect information to MDN will not help those "not yet capable of finding the correct information" instead of forgetting about them; quoting myself:

This has me worried. We've raised multiple concerns about the inaccuracy of the LLM output. Saying "you can ignore it" just shifts the responsibility for determining whether the output is inaccurate and should be ignored or fact-checked to the users, which is especially problematic given that:

Those most likely to want a simple summary of technical documentation are those least likely to determine the truth and accuracy of an LLM's output supposedly explaining the content they are not knowledgeable about

alahmnat commented 11 months ago

we are aware of the limitations of LLMs, and we know that the LLM doesn't understand the question or these instructions, and only uses statistics to come up with the next words.

See, you say that, but then your very next words are

However, the crux is that it works surprisingly well

No, it doesn’t. It appears to work surprisingly well, but you can never be certain whether you’ve gotten the one true book containing your life’s story or one of the ones that’s just 60,000 q’s in a row from the infinite library of every combination of words ever made, and that is fundamentally the problem.

As for “incorrect answers can be helpful,” I’d like to go on record as saying that I find incorrect answers given to me by a tool that is supposed to give me correct information to be nothing but infuriating. I don’t even like getting wrong information from Stack Overflow answers because now I’m having to waste more of my time trying to figure out why it’s not working as expected. I’m sure we’re all more than familiar with adapting Stack Overflow answers that sort of answer the same question we’re trying to ask, but that, too, is a fundamentally different process than “ask the magic answer box my exact question and get an exact answer that should work”.

Finally, I think if you really wanted to impress upon your users the limitations of these tools, you wouldn’t call them “AI” anything. You’d call them “LLM Help” and “LLM Explain”. “AI” has so many sci-fi implications about sentience and reasoning and understanding embedded in it that expecting people to see “AI” in the name of a tool and think “box that makes convincing-sounding sentences” is, frankly, laughable. Despite disclaimers plastered every which way, people are still using ChatGPT to do things like write translations and write legal briefs full of hallucinated court case citations. People will not use these tools the way you expect them to, doubly so if you keep insisting on calling them something they very blatantly are not: artificial intelligence.

resuna commented 11 months ago

Finally, I think if you really wanted to impress upon your users the limitations of these tools, you wouldn’t call them “AI” anything. You’d call them “LLM Help” and “LLM Explain”. “AI” has so many sci-fi implications about sentience and reasoning and understanding embedded in it that expecting people to see “AI” in the name of a tool and think “box that makes convincing-sounding sentences” is, frankly, laughable.

In a fair world the people who introduced these programs by referring to them as AI would have burst into black flames for the sheer hubris of it all. They are parody generators. Nothing more.

ghalfacree commented 11 months ago

I am aware that management has long moved on and am not expecting a response, here, but I wanted to raise this nevertheless just in case someone who can effect change sees it by chance.

The paper Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions, Kabir et al, 2023 (preprint) delivers exactly what its title suggests. It finds that ChatGPT answers for software engineering questions are wrong 52 per cent of the time - to within a margin of error the same as tossing a coin.

But it goes deeper than that. Because ChatGPT and other LLMs write very, very convincingly, their answers are often preferred over human equivalents (from Stack Overflow, in the case of Kabir et al) - 39.34 per cent of the time, in this case. Of the preferred answers, over 77 per cent were wrong.

So, given MDN is using the same technology, I believe it would not be unreasonable to assume the same holds true: of those users clicking the button to report an answer as "helpful," as many as 77 per cent may have done so on an answer which is wrong. But, because they're unfamiliar with the subject matter and ChatGPT's output is designed to sound helpful, they have no idea they're being led up the garden path.

noahbroyles commented 10 months ago

In my professional opinion, LLMs have no place being included on MDN, where developers come looking for trustworthy technical information.

As someone who has used ChatGPT for technical questions numerous times, I know from experience that although it can be quite useful sometimes, it very frequently spews out misinformation and leads you down a rabbit hole of plausible looking garbage. Often it can take more time trying to get ChatGPT to arrive at a working solution that it would to just use a trustworthy source of documentation (like MDN is supposed to be).

This is very confusing and frustrating, especially for newer developers. The things that LLMs can actually answer accurately (most of the time), are simple, well known things that a quick Google search would have sufficed for. There is a reason why ChatGPT is banned on StackOverflow:

Overall, because the average rate of getting correct answers from ChatGPT and other generative AI technologies is too low, the posting of answers created by ChatGPT and other generative AI technologies is substantially harmful to the site and to users who are asking questions and looking for correct answers.

I also find it very concerning that newer developers turn to ChatGPT and AI in general as a source of guidance. It is too easy for developers to use it as a crutch. This is dangerous because unlike a calculator being used in mathematics, LLMs/ChatGPT do not always present factually accurate outputs. While using a calculator will always provide an accurate answer for the problem entered, LLMs have no such guarantee. Using GPT is not just detrimental to developers because it reduces their ability to do their own work, but also because it introduces a higher probability of error and often can waste a lot of time.

TD;DR: LLMs are not a good source of factual information, and as such MDN shouldn't expect to be considered a reliable source while they have it included on their website.

kyanha commented 9 months ago

I know that no action is going to be taken on this.

But I would be remiss if I didn't provide this link (not written by me): https://www.zdnet.com/article/third-party-ai-tools-are-responsible-for-55-of-ai-failures-in-business/

megmorsie commented 8 months ago

This is dangerous because unlike a calculator being used in mathematics, LLMs/ChatGPT do not always present factually accurate outputs. While using a calculator will always provide an accurate answer for the problem entered, LLMs have no such guarantee.

Yes! I just made this exact comparison to someone recently. So often the applications people are pushing LLMs for already have solutions (keyword searches, math calculations, boilerplates/templates, etc). And those solutions aren't using an insane amount of processing to get results, sapping communities of potable water, requiring a precarious data training labor pool, etc. The externalities of "AI" and LLMs are massive and it's so frustrating that people hand-wave these important factors away on top of the technology itself being demonstrably worse than things we already have.