mdn / yari

The platform code behind MDN Web Docs
Mozilla Public License 2.0
1.16k stars 485 forks source link

The AI help button is very good but it links to a feature that should not exist #9230

Open nyeogmi opened 1 year ago

nyeogmi commented 1 year ago

Summary

I made a previous issue pointing out that the AI Help feature lies to people and should not exist because of potential harm to novices.

This was renamed by @caugner to "AI Help is linked on all pages." AI Help being linked on all pages is the intended behavior of the feature, and @caugner therefore pointed out that the button looks good and works even better, which I agree with -- it is a fantastic button and when I look at all the buttons on MDN, the AI Help button clearly stands out to me as the radiant star of the show.

The issue was therefore closed without being substantively addressed. (because the button is so good, which I agree with)

I think there are several reasons the feature shouldn't exist which have been observed across multiple threads on platforms Mozilla does not control. Actually, the response has been universally negative, except on GitHub where the ability to have a universally negative response was quietly disabled Monday morning.

Here is a quick summary of some of those reasons.

One, the AI model is frequently wrong. Mozilla claims it intends to fix this, but Mozilla doesn't contain any GPT-3.5 developers and OpenAI has been promising to fix it for months. It's unlikely this will actually happen.

Two: contrary to @caugner 's opinion, it's very often wrong about core web topics, including trivial information where there is no obvious excuse. Here are some examples:

Even examples posted by people who support the existence of the AI contain significant errors:

(I say examples, but note: this is the only usage example provided by a person who supported the existence of the feature, and it contained an error.)

This is identical to one of the categories of problem seen on StackExchange when StackExchange introduced its generative AI assistant based on the same model, and it led to Stack removing the assistant because it was generating bizarre garbage.

Three: it's not clear that any documentation contributors were involved in developing the feature. Actually, it's still unclear who outside of @fiji-flo and @caugner was involved in the feature. Some contributors including @sideshowbarker have now objected and the process has produced a default outcome, which is that AI Explain was voluntarily rolled back and AI Help remains in the product.

It is probably OK for those contributors to review each other's own code, but they're also managing the response to the backlash. After a bunch of people have already signaled "hey, I have an active interest in this feature" by engaging with a relevant issue, excluding those people reflects that a ruling of "actually, you do not have an active interest!" has been reached, and it's not clear what basis that ruling would have been reached on.

Four: the existence of this feature suggests that product decisions are being made by people who don't understand the technology or who don't think I understand it.


Overall, the change tells the story that MDN doesn't know who their average user is, but assumes that the average user is (1) highly dissimilar to the GitHub users who were involved in the backlash (2) easy to sell to.

The fact is that in one day, measured in upvotes, you attracted comparable backlash to what the entire StackOverflow strike attracted in a month. It would be a mistake to think only a small group of people are concerned. This attitude would be wishful thinking.

It seems like the fork in the road for MDN is:

If option 1 isn't sustainable, then between option 2 and option 3, option 3 is obviously better for humanity in the long-run and I would encourage MDN to make plans for its own destruction.

In the worst possible world, the attitude is correct and the users are easy to sell to. Well, in that case, you've created another product company and in doing so you've metaphorically elected to serve both God and money -- and as is evidenced by the recent implosions of every siloed social media company, that is always a great idea.


Again, the AI Help button is absolutely gorgeous and functions as intended. This issue is not about the AI Help button and therefore should not be closed as a button-related wontfix, or renamed by @caugner into a description of the behavior of the button.

URL

https://github.com/mdn/yari/issues/9208 https://github.com/mdn/yari/issues/9214

Reproduction steps

Pivot to a more aggressive funding model, then engage in a mix of panic and corporate groupthink.

Expected behavior

I think the button is amazing and you are doing a great job.

Actual behavior

The AI help feature should not exist.

Device

Desktop

Browser

Chrome

Browser version

Stable

Operating system

Windows

Screenshot

image

Anything else?

No response

Validations

Ultrabenosaurus commented 12 months ago

so many eyeballs are on this thread that it's very difficult for any of us to respond specifically to concerns raised, because as you just said, they'll be considered Mozilla's public "official" response, rather than merely the thoughts of some of the engineers who worked on the feature

I'm quite concerned that, yet again, we've had an official response

I mean, come on folks.

I'm quite concerned that, yet again, we've had an official response which makes absolutely no reference to any sort of "correctness" for the output

@Ultrabenosaurus We are well aware of your questions about helpfulness vs correctness, and in all fairness, you should wait to hear our answers in the community call.

I mean, come on, Leo replied to this thread as a developer of these tools when a community call has already been arranged to discuss our concerns - that Leo was responding officially was a valid assumption. I was also in the process of typing my response before Leo's second pointing out they weren't commenting in an official capacity, and did not see it before posting mine.

yoe commented 12 months ago

Hi all, thanks for the concerns raised, we hope to answer them all in the community call on Wednesday, but one suggestion which has come up a few times with regard to AI Help is really interesting because we've had a very similar discussion internally, and I'd like to answer/expand upon it a little bit immediately. It's been asked a few times in a few ways, I'll answer the most recent instance:

I think there is a way to turn this into something useful: have the LLM return only links to MDN pages, and present those to the user. Links can't be wrong if you check that the resource exists first.

Yes, indeed, and in a way we already do this.

We don't ask the LLM to only use MDN docs in its reply, and then go and check if the references exist, because there's quite a few problems with that approach. First and foremost, as has been mentioned a few times in this thread, an LLM can just completely make up a source, and if we check that and that's happened, that really leaves us with nowhere to go with the response. I guess we'd have to return an error. These models also have a training cutoff point, so we would never be able to respond to queries about new pieces of documentation.

So, in other words, what you're saying is, the tool you're using is not the right tool for the job (if it was, you would be able to do the things you're saying you can't do with the tool you're using), but you want to use it anyway because "AI is cool" or something.

LLMs are amazing bits of technology, but other than making people go 'whoa, look at shiny cool things', generating plausibly-sounding swaths of garbage text, and passing a made-up test for "AI" (known as the "turing test"), they're not really very useful at all.

Instead, what we do is generate embeddings using an embedding model for each section of each piece of documentation on MDN in its raw Markdown form, store those, and then also generate an embedding for a user's question. Think of an embedding as placing a piece of text on a number line (but this number line has multiple dimensions): we can then find pieces of text which are similar, because they're close to each other on this number line. And that's what we do, we find the sections of MDN's documentation which are closest to the query.

Yay, so stop there and provide links to the sections you found?

We then go and feed all that back into an LLM to effectively summarise those sections of content in the context of the question posed. But you're free to fully ignore that summary, and just skip straight to the links at the bottom of the response ("MDN content that I've consulted that you might want to check") - those are the pages found through the similarity search.

... that's stupid. You had the perfect response, but you ruined it by throwing it through a text mangler.

caugner commented 12 months ago

even a simple "We don't know" or "haven't got the answer to this yet"

@Zarthus I can assure you: We do know, and we do have the answers, and we'll be sharing the answers to the submitted questions in the Community Call.

To be frank, I don't think there is any way we could have posted comprehensive answers to every aspect of every message here and there, some of which are clearly misleading and not written in good faith, beginning with this issue's description (incorrectly reproducing my words; omitting that the examples included in the description are all from AI Explain, which was already disabled at the time of posting the issue, and not a single one from AI Help) and including your message equating Leo's message with "the public response from Mozilla". I could go on, but won't, because it's not my job and I don't have the time.

LeoMcA commented 12 months ago

... that's stupid.

@yoe perhaps communicating in such a manner was acceptable in the open source world of old, perhaps it's still acceptable in the Debian community, but in Mozilla we adhere to a set of Community Participation Guidelines and I'd like to draw your attention to some points:

Value each other’s ideas, styles and viewpoints. We may not always agree, but disagreement is no excuse for poor manners.

Be respectful in all interactions and communications, especially when debating the merits of different options. Be aware of your impact and how intense interactions may be affecting people.

It is not okay to insult, demean or belittle others.

Zarthus commented 12 months ago

some of which are clearly misleading and not written in good faith, beginning with this issue's description (incorrectly reproducing my words; omitting that the examples included in the description are all from AI Explain, which was already disabled at the time of posting the issue, and not a single one from AI Help) and including your message equating Leo's message with "the public response from Mozilla"

I don't think this issue's OP or myself have made attempts or are acting in bad faith, but if that is how you feel, that's how you feel I guess.

I've thought about if I wanted to write more - but it seems you already think of me as a bad actor rather than an avid enthusiastic user of MDN, so I'm not sure if anything I'll be saying would change your mind at this point.

yoe commented 12 months ago

... that's stupid.

@yoe perhaps communicating in such a manner was acceptable in the open source world of old, perhaps it's still acceptable in the Debian community,

It's not, and as the person who drove the process that produced the Debian code of conduct I really should've known better, but I guess in this instance I did before I thought.

Apologies. This was indeed uncalled for. I will try to do better.

caugner commented 12 months ago

I don't think this issue's OP or myself have made attempts or are acting in bad faith, but if that is how you feel, that's how you feel I guess.

I've thought about if I wanted to write more - but it seems you already think of me as a bad actor rather than an avid enthusiastic user of MDN, so I'm not sure if anything I'll be saying would change your mind at this point.

@Zarthus When I choose words like "not in good faith", I would kindly ask you not to inaccurately reproduce these as "bad faith" nor imply I think of you as a "bad actor". Ironically, I'm not sure GPT-3.5 would produce this inaccuracy (and yes, it's not super inaccurate, but still inaccurate).


@Zarthus As for your "the public response from Mozilla" comment: Good faith could have meant that you acknowledge Leo's comment as is, without promoting it to "the public response from Mozilla" and possibly respond to its content, but not jump onto making a sarcastic remark about how only 3.7% of it referred to the concerns. It looks like your comment did in fact not at all touch upon the content (apart from the statistics bit).


As for this issue's OP, let me provide evidence for my claims of it being "misleading" and "not [written] in good faith":

First of all, the OP specifically mentions AI Help in the title and the first paragraph of the "Summary":

I made a previous issue pointing out that the AI Help feature lies to people and should not exist because of potential harm to novices.

Then, in the second paragraph, the OP goes on inaccurately reproducing my words:

and @caugner therefore pointed out that the button looks good and works even better,

The text may be harmless, but it's also 100% made up. I didn't even write anything remotely related in this comment. And that's just not okay. What other parts might be made up?

Further on, beginning with the sixth paragraph, the OP states this:

One, the AI model is frequently wrong. Mozilla claims it intends to fix this, but Mozilla doesn't contain any GPT-3.5 developers and OpenAI has been promising to fix it for months. It's unlikely this will actually happen.

Two: contrary to @caugner 's opinion, it's very often wrong about core web topics, including trivial information where there is no obvious excuse. Here are some examples:

Even examples posted by people who support the existence of the AI contain significant errors:

(I say examples, but note: this is the only usage example provided by a person who supported the existence of the feature, and it contained an error.)

Again, I had not actually made any statement whatsoever about how often "the AI model" is wrong, but my main point is that the OP conceals that those examples are for AI Explain (which was already disabled), and not AI Help. And this is very misleading, because anybody reading it might conclude that we keep AI Help running despite these examples (none of which are really shocking, to be honest).

Good faith also applies to myself, so let's give the OP author the doubt of the benefit. Did they just not know? Turns out both my comment in eevee's issue and my comment in the OP author's original issue that the OP author is referring to explicitly mention that AI Explain and AI Help work differently, and the OP author also acknowledged further below that these are different features, and that AI Explain was rolled back:

AI Explain was voluntarily rolled back and AI Help remains in the product

So the bottom line is that the OP is misleading and not written in good faith.


To finish in positive terms, I'm hoping to see y'all in the Community Call tomorrow 👋 , and to see more constructive comments written in good faith here going forward. 👍

snoozysoft commented 12 months ago

I have to chip in that honestly for as much of an effort you're making in pointing out the confusion over AI Help and AI Explain, it mostly just says to me what a terribly thought out idea it all was given you launched two separate AI related things with incredibly similar names that do incredibly similar things. It's not especially surprising to me that people seem to be conflating it.

And honestly as much as I would like to hear you out and as much as I can understand how you presently might be frustrated, I can't help but feel your big response there is somewhat in bad faith in itself by finding any reason possible to obliquely dismiss the issue at hand. The confusion between Help and Explain is happening, sure, but the ultimate point of the issue is we don't want ANYTHING related to this AI on MDN at all regardless of what function it is performing.

Xkeeper0 commented 12 months ago

I feel like arguing that the OP might be raising this issue in bad faith is extremely disingenuous considering that you renamed their initial issue to completely mask the issue they raised:

caugner changed the #9208 title from MDN can still automatically lie to people seeking technical information to AI Help is linked on all pages last week

Zarthus commented 12 months ago

@caugner I appreciate your response, and I absolutely agree that I was wrong to assume a comment from a member on the repository, to equate that of a public response.

I would add that confusion and assumptions happen because of lack of information and transperancy, I work with the information that I was given and that includes that they are a member of the repository and that they've made ai help commits.

Does that make me wrong to assume they are a representative of the feature? Most likely. Was it unreasonable? I see multiple people thinking the same so I personally do not think so.

Does that make me wrong that writing something as "misleading and not in good faith" equates bad faith over neutral faith? Most likely. Was it unreasonable? I personally do not think so. You could have written it in such a way that it implies it needs a bit more good faith rather than a neutral tone, and it wouldn't come across that way.

Is there confusion between what AI help and AI explain are? Yes, let's face it; a week ago I couldn't tell you the difference.

It felt like a personal insult, to someone who had been passionate about the MDN for over 10 years.

I'll try to do better to shift that perspective more positive, all I ask is that the AI Help team does the same.

HerminaC commented 12 months ago

@Zarthus and all, on Mozilla MDN's team side, we'll also try to do better in the future. I am sorry that this release generated confusion and distress for you all. It was definitely not our intention. We might not have directly addressed every question raised on this thread, but I can assure you that we read them all carefully, and had several internal post-mortem conversations. We are listening and we intend to listen even more moving forward. In case you haven't seen it yet, we published a postportem a couple of hours ago. I know it won't immediately address all your questions and concerns, but I hope the postmortem, together with the community call that will follow today and a renewed commitment to involving our community in product work, will contribute to moving things forward from here.

resuna commented 12 months ago

Can you explain how this system was even considered in the first place? This class of software is not designed for the purpose of generating factual and accurate information. It is designed for generating output that looks like it is a plausible continuation of the prompt. That's it. If they do more than simply look like a credible parody it is only by chance. Why was this considered desirable?

nicuveo commented 12 months ago

In case you haven't seen it yet, we published a postportem a couple of hours ago.

Thank you! But i'm sorry to say that this confirms everything we've been fearing.

We saw potential for uses such as summarizing documentation pages, explaining existing code samples, generating examples for pages lacking them, and generating unique code examples that combined different technologies according to user queries.

This is exactly what we've been warning you about: an LLM will not be good at any of those tasks, because it cannot be reliably accurate. Sure, it could be used to generate examples for pages lacking them, but the generated examples would need a human review before being shown to users, to make sure there aren't any inaccuracies. That's because an LLM's output should never be shown directly to users without first being reviewed by a knowledgeable human: it cannot be used for live content generation. Consequently, most of those examples you cite as a good potential use of "AI" are wrong, plain and simple.

This is why the distinction between "AI Help" and "AI Explain" is irrelevant: in both cases, you're trusting an LLM to generate content that the users will see, without human oversight. This is wrong. It will be inaccurate. You cannot make it accurate.

We also saw that other developer-focused companies were investing significantly in and building products on top of the technology. With MDN's documentation being publicly available under an open Creative Commons license, and considered best-in-class, it's reasonable to assume most models have been trained on our content, and products started to be announced which explicitly allowed the consumption of our documentation. This led us to understand that, irrespective of our personal feelings, our users already access our content through generative AI.

This is extremely sad to read, because it confirms my fear that this entire project is just following blindly the rest of the industry into putting "AI" into things that don't require it. The fact that others are using "AI" doesn't mean that you should. The fact that they're scrapping your data (without your permission?) doesn't mean that you should use their product. The fact that "AI" will regurgitate an inaccurate version of your documentation in other places doesn't mean that your users should expect the MDN to also have tools that regurgitate an inaccurate version of your documentation.

Worse, this paragraph says outright that you assume that models have a correct version of your documentation. But you have no way to check that. You have no way of knowing what version of your documentation they have. When you make a change, you have no way of knowing when it will land in those models, or if the old version will be removed. You are blindly trusting nonsense-generating machines, trained on old versions of your data, to provide an accurate reflection of your content as it is today, without human oversight.

In short: with every response i've seen, i am more and more convinced that this project has been misguided from the start. It is built for the wrong reasons, it is built on the blind faith that LLMs will magically be correct.


Further in the article:

AI Explain will not be reintroduced to MDN until we are confident that it delivers reliable information that both readers and contributors can trust.

My point, our point, is that without human oversight of the results, there will never be a point at which an LLM can be trusted with the results. It's an LLM, not an "AI". It will output inaccuracies. While i appreciate pausing AI Explain for now, your insistence that there will be a point at which it might be trusted betrays a fundamental misunderstanding of the capabilities of an LLM.

For AI Help specifically, we added a "Report an issue with this answer on GitHub" link to all answers, making it easy for users to raise issues in a dedicated ai-feedback repository with all necessary context for us to expedite bug fixes and enhancements of the feature.

As pointed before in this issue, there are two major issues here.

  1. The onus of verification is on your users, the very people coming to this site for information. You are writing a feature that outputs content that needs to be checked against the rest of your site, by your users.
  2. What good is feedback about inaccurate content if you don't control the machine that generates said bad content? Even if you ran your own LLM, you'd have no reliable way of correcting the model based on user-submitted feedback to guarantee that the next generated bit of text doesn't contain the same inaccuracy.

We know that technical accuracy is why our readers come to MDN

Then why are you so insistent on integrating a third-party inaccuracy generator into it?

esotericist commented 12 months ago

The onus of verification is on your users, the very people coming to this site for information. You are writing a feature that outputs content that needs to be checked against the rest of your site, by your users.

i also note again the people who are best equipped to recognize the errors are not the ones who are going to be most harmfully subjected to the errors.

presume for a moment that expert users start identifying problems with things experts ask about and look at and somehow -- through some mystical process -- successfully assist in forward steps to more accuracy. even with that, there's no guarantee said experts will at any point observe the same kind of flawed content the non-experts experience, either due to differences in the kinds of prompts that users with different mindsets produce, or due to the inherent lack of determinism these systems have designed into them.

the normal information correction loop for a knowledge base is irreparably shattered by the division in output that naturally comes from using llms to provide tailored responses.

kyanha commented 12 months ago

presume for a moment that expert users start identifying problems with things experts ask about and look at and somehow -- through some mystical process -- successfully assist in forward steps to more accuracy. even with that, there's no guarantee said experts will at any point observe the same kind of flawed content the non-experts experience, either due to differences in the kinds of prompts that users with different mindsets produce, or due to the inherent lack of determinism these systems have designed into them.

the normal information correction loop for a knowledge base is irreparably shattered by the division in output that naturally comes from using llms to provide tailored responses.

To put it another way, in a user story, with 'you' being Mozilla:

  1. I am a back-end coder, and I don't know most of the information necessary to modern Web development.
  2. I want to write a front-end.
  3. I want to go to a knowledge base and get the information I need.
  4. I'm presented information that I have no realistic way of knowing is false, and I bang my head against the inaccuracy for hours.
  5. You want me to then continue to do uncompensated labor by summarizing and reporting what didn't work, and then search even harder for the 'correct' answer so I can explain what exactly was wrong about it. BECAUSE I AM NOT AN EXPERT TO KNOW OFF THE TOP OF MY HEAD.
  6. By then my ADHD has gotten frustrated with it and says "web development is too hard" and I've now had another negative experience that makes me even more unlikely to stick my head out of the back end in the future.

And all because you have blindly followed "industry practices" that basically abdicate your role in deciding what's best for your interface with your userbase.

How many "bad AI answer" issues are going to have to be filed for you to recognize that even a single one is too many? You're supposed to be making the web easier to work with, not adding new hurdles and sources of misinformation.

Ultrabenosaurus commented 12 months ago

@kyanha and @nicuveo I have referenced your comments in my new Discussion thread for the Community Call later today - let me know if you'd like me to remove it.

obfusk commented 12 months ago

AI Explain will not be reintroduced to MDN until we are confident that it delivers reliable information that both readers and contributors can trust.

Then you should not be using an LLM. It's a chatbot. It cannot deliver reliable information no matter how much you want it to or how helpful it may appear to be. Because it's not a machine for doing that. It's a "'say something that sounds like an answer' machine".

Just as a human might (as an expert or a peer in a learning community) give an incorrect response, it is still ultimately useful as it unblocks, gives our users ideas, and points them to something relevant.

But it doesn't just give "incorrect" responses the way a human might if they simply forget or misremember something. It doesn't lie either. It's a model for creating plausible human-sounding conversations. When it produces a "response citing a non-existent paper with a plausible title, using a real journal name and an author name who's written things related to your question", this is not a bug. It is doing exactly what it's meant to do, which is not "deliver[ing] reliable information", no matter how much you want to use it to do that because it seems like it does a lot of the time.

dhasenan commented 12 months ago

The video call today was big on the idea that incorrect information can still be helpful. Can we get concrete examples of this? Even anecdotes would be better than a bare assertion.

caugner commented 12 months ago

The video call today was big on the idea that incorrect information can still be helpful. Can we get concrete examples of this? Even anecdotes would be better than a bare assertion.

We posted two examples further up in the thread, but I'm happy to take suggestions for some reasonable questions that should receive helpful answers from AI Help.

PS: It's surprising to see how quickly folks jump on giving this comment a 👎, when all I'm doing here is pointing out two examples (that were criticised for supposedly being cherry-picked) and at the same time asking for suggestions specifically to avoid the criticism of cherry-picking questions. 🤷

nicuveo commented 12 months ago

help me there: how do you reconcile those two statements?

i am gonna be very honest: i will never rely on a technical reference whose official position can be summarized with "incorrect information can still be helpful".

caugner commented 12 months ago

@nicuveo I'm not sure if you've been on the Community Call, but we made it very clear in that call that we don't use AI to generate reference documentation, and we never will.

AI Help (beta) provides an interface to consuming information on MDN, and we make it very clear to set the expectations right for users of the feature by putting "AI" in the feature's name and having prominent disclaimers. There is hopefully no way for a user to confuse an answer of AI Help with reference documentation.

AI Help (beta) prioritises MDN content to answer questions, provides sources and usually rejects questions it cannot answer with MDN content, and that's a big difference compared to tools like ChatGPT.

resuna commented 12 months ago

People are downvoting your answer because it seems like a non-sequiter, it's really difficult to see how your two examples demonstrate how incorrect information can be helpful.

Do you have a transcript for the video meeting?

obfusk commented 12 months ago

delivers reliable information that both readers and contributors can trust

make it very clear to set the expectations right for users of the feature by putting "AI" in the feature's name and having prominent disclaimers

So... you're admitting you can't actually deliver reliable information and believe that adding a disclaimer that people are "free to fully ignore" the LLM's output (leaving aside they have no way of knowing when to do that) will somehow make things right?

"You don't have to believe it" is just about the worst possible response to people worried about misinformation.

There is hopefully no way for a user to confuse an answer of AI Help with reference documentation.

That's not the issue here though. The issue is that the LLM answers are not reliable, cannot be made reliable because it's a chatbot not a "reliable answer machine", and that you have no way to avoid users confusing a wrong answer for a helpful one. You still have not addressed this AFAIK.

My apologies if I seem irritated at this point, but AFAIK there has been no response to these concerns. I was unable to attend the video meeting and I would love to hear they were addressed there but somehow I doubt it.

nicuveo commented 12 months ago

@caugner no, sadly, i wasn't able to attend.

but your response is fascinating. if you allow me to rephrase, here is what i understand from it: you know the LLM can't be trusted with writing reference documentation, so you restrict its use to answering questions and finding sources; but you still don't trust it to accurately do that, so you include heavy disclaimers to users that they shouldn't rely on those results. and you hope that users won't be confused.

and wow this raises many questions, as well as confirm some of the worries we already mentioned in this issue:

in short: this basically answers none of the questions that have been asked in this long issue.


on a personal note, i continue to be baffled by the choices being made here. LLMs are simply the wrong tool for the job, and i can't understand how any of this was greenlit. and to reiterate my point: the callous attitude displayed throughout this entire ordeal reflects poorly on the MDN as a whole, and even if none of the actual content is currently generated with a LLM, you have seriously damaged user trust with this project.

caugner commented 12 months ago

As announced in the Community Call invitation, we're sharing our answers for anyone who couldn't attend. We'll be adding them in the individual GitHub Discussions threads.

develleoper commented 12 months ago

So far, the examples presented have some genuine benefits, however none of these outweigh or undo the harm posed to users unable to independently verify the accuracy of an LLM's fabrications.

Given that any use of an LLM to "summarize content" † inherently compromises the guarantee of its accuracy, hosting any interface to do this within that documentation decreases the amount of the site which users can trust to be accurate from 100% to less than 100%.

I believe this is a question of product-direction: does Mozilla intend for MDN's core identity be a trusted industry reference?

† Produce plausible-sounding responses to queries for that content that are not guaranteed to be accurate. Also, as I type this I see the latest reply; thank you, I very much appreciate the transcript.

obfusk commented 12 months ago

@caugner is there an overview of those relevant discussion threads so we don't have to look for them ourselves?

caugner commented 12 months ago

@caugner is there an overview of those relevant discussion threads so we don't have to look for them ourselves?

Sure: https://github.com/orgs/mdn/discussions/categories/community-calls

PS: Also added the link to my previous comment now.

Xkeeper0 commented 12 months ago

I risk stating the obvious here, but there seems to be a deep, fundamental misunderstanding going on. MDN (and its representatives here) seem to think that if they just do more, they can somehow make this palatable to MDN's users. And on the other side, MDN's users are very clearly saying "we do not want this, at all."

It's like going to a 100% vegan restaurant and finding out that they're now serving beef. No amount of dressing it up, changing the ingredients, or otherwise trying to come up with some scheme to make it palatable is going to work, because it is a fundamental misunderstanding of the problem.

At the end of the day, almost every person who has spoken up against this feature has made it abundantly clear they do not want it, they want no part of it, and they find the mere action of bringing it in abhorrent. Meanwhile, MDN continues to plug its ears and pretend that if they can just tweak it a little more, they can fix the fundamental problems, this time it will work.

MDN has convinced itself that, because everyone else is drinking the AI poison, they should too, and no amount of evidence or outcry seems able to knock them out of this position.

Be-ing commented 12 months ago

https://developer.mozilla.org/en-US/blog/ai-explain-postmortem/

We also saw that other developer-focused companies were investing significantly in and building products on top of the technology. With MDN's documentation being publicly available under an open Creative Commons license, and considered best-in-class, it's reasonable to assume most models have been trained on our content, and products started to be announced which explicitly allowed the consumption of our documentation. This led us to understand that, irrespective of our personal feelings, our users already access our content through generative AI.

This could be an Onion article headline: "AI only built into things by the only people who believe it is inevitable to build AI into things"

With the support of our community, we are uniquely positioned to learn how AI can augment documentation and how developers use it. Our extensive experience, a willingness to learn, and our network of contributors and subject matter experts place us in a great position to refine this functionality through iteration and feedback.

This is insulting. You do not have the support of the community. You are not positioned to learn; with every blog post and comment here, Mozilla repeatedly demonstrates a refusal to learn from the community.

ghost commented 12 months ago

I am sorry to add a basically empty comment here, but simple reactions are not enough--I cannot stress enough how correct many, many respondents here are with their commentary. I will never again rely on a resource that actively takes a stance that "inaccurate information can still be useful" -- it is not and cannot be.

The postmortem shows that the stance of Mozilla is going to be doubling down on this nonsense, which is quite concerning for the future.

Ultrabenosaurus commented 12 months ago

@Xkeeper0 @fournm I watched the community call yesterday and Leo did a great job of explaining MDN's stance on the LLMs presenting inaccurate information. Essentially, what I took from his explanation was that:

People are already willing to use things like old Stack Overflow answers or ask ChatGPT directly themselves while inventing their own prompts, knowing both of these things could be out of date or plain wrong, because technical documentation is simply not accessible or approachable for novice / junior developers (and people who have difficulties reading in general).

In that case, MDN is almost uniquely positioned to provide the least inaccurate LLM output for simple explanations of technical content, due to the vast amount of high-quality documentation and active community of subject experts keeping on top of it.

MDN docs must be correct but may not be helpful due to technical wording; these "AI" are designed to be "helpful" (see: written in a more approachable, natural manner) without actually editing the MDN docs to reduce accuracy for simplicity.

Because the tools have "AI" in their names and (now) disclaimers about accuracy, people will treat their output in the same non-authoritative manner as old Stack Overflow answers, and use the output to experiment and learn rather than treat it with the same level of trust as experienced developers have for MDN documentation itself.

I don't agree with that stance, but I can see where MDN are coming from and Leo did a great job of explaining it.

Back on track, that explanation form Leo really nailed down the fundamental misunderstanding and the real crux of the conflict between us and MDN on the issue:

Those stances are simply not compatible with each other.

A possible solution could be using a scaled-back version of AI Help to improve search results for natural-language input, then including full-scale AI Help and something like AI Explain as a separate platform that does not display LLM output directly on MDN pages and within MDN documentation.

Alternatively, the LLM generated output could be reviewed by subject matter experts and, if approved like any other content change submission, be added to the page permanently as a "simple summary" block. That would drastically mitigate the risk of an LLM sometimes generating correct content and sometimes generating incorrect content for the same question / documentation page.

Will MDN be open to that sort of compromise? I don't know, but they did specify at the start and end of the call that they will not be reintroducing AI Explain this year, though they did make it clear they will still be working on it to create a usable, community-acceptable version and there was no mention of removing or notably modifying AI Help.

querkmachine commented 12 months ago

The "old Stack Overflow answers" point is irrelevant because anyone can go to that answer, comment that it's wrong or outdated, and provide a better or more contemporary answer in a way that is visible to both the original question asker and anyone who stumbles upon it in future.

We cannot do that with the LLM tool. We cannot point out when it is wrong or outdated in a way that is visible to others, nor would that seem to be possible, as the responses it gives has no guarantee of being consistent from user-to-user.

This technology just isn't mature enough to be trustworthy without manual human intervention or validation, it certainly shouldn't be used for documentation that is intended to be authoritative.

snoozysoft commented 12 months ago

imagine a world where the feedback from the person who is within the demographic they're targeting wasn't ignored despite writing a very impassioned speech

anyway, how about I just reiterate my previous idea; if you want to attract novices to the point of having an "explain like I'm 5" button, how about you, I don't know, have actual people write it because then it can be assuredly accurate. it is not that hard!!

joelanman commented 12 months ago

Just posting in agreement of all the very obvious and clear issues everyone has raised here. More fundamentally this feature launch and the subsequent response from Mozilla/MDN has hugely damaged my trust in the org and the choices it makes. This is such a shame when I previously held MDN in high regard as a trustworthy place to get information. It's so hard to build that kind of reputation and trust, and this episode is very much damaging it.

acdha commented 12 months ago

We are listening and we intend to listen even more moving forward. In case you haven't seen it yet, we published a postportem a couple of hours ago.

This is informative but also illustrates the perils of focusing on the technology rather than the users’ needs. This project was started right when the current AI marketing push was shifting into high gear and a recurring theme in that write up is the unquestioned assumption that the technology needs to be used rather than whether it’s actually useful. One part was especially telling, being completely in line with the marketing rhetoric around trimming or deskilling jobs:

Therefore, one method we tried was to run explanations created by gpt-3.5-turbo-0301 for each of the 25,820 unique code examples on MDN and validate the output against GPT-4. The summary of the responses from this experiment tagged generated explanations as either accurate, somewhat inaccurate, or incorrect. We randomly sampled responses that were not considered high-quality and manually inspected and evaluated them.

The obvious fatal flaw in this approach is that you can’t ask a system with no understanding to evaluate whether another system with no understanding correctly understood something. This is the kind of design you use if you’re trying to generate SEO filler content or minimize your spending on customer service but not if your goal is authoritative documentation. All of the responses needed to be reviewed, especially the ones ranked highly, and not just casually but by domain experts and editors – in this thread we’ve already had multiple examples provided in support of this feature which upon closer examination turned out to contain inaccuracies or, characteristically, simply not be very helpful to anyone in the target audience.

yoe commented 12 months ago

Allow me to take a step back and wonder what the actual problem is that MDN is trying to solve here. I can see multiple possibilities, and only some of these could benefit by using an LLM:

  1. Our documentation is too difficult to understand for novices.

    Then the right answer is not to use a gibberish generator; instead the right answer is to rewrite the documentation so it does become more easy to understand for novices.

  2. We don't have the budget and/or manpower to do what needs doing, and this does it cheaply

    Ask for donations and/or more contributions. I don't have the time to help, but would be happy to put in some cash if that'd improve the situation.

  3. Our documentation is difficult to search

    Using an LLM to better interpret what the user is wanting to find is not a bad idea, as long as you don't use the gibberish generator of an LLM to try to be "helpful".

  4. Other people are using AI, we should too

    That is not a problem that needs "fixing".

  5. Our code examples are difficult to understand if people look at them in isolation

    So make it easier for people to look up the bits that they don't understand. If I could go from an example that uses CSS selectors immediately to the explanation of the CSS selectors with a click or two (or a mouse hover, or whatever else might be useful -- the point is to make it easily accessible), then that would make things a lot easier to understand

  6. We don't have the budget and/or manpower to do that

    See point 2.

I think everyone agrees that the goal of trying to improve MDN in and of itself is laudable, and it does make sense to look at modern technologies to see how they can help do that; however, I think in this particular case, you may have been looking for a problem to fit a solution, rather than the other way around.

nicuveo commented 12 months ago

Thanks to @acdha for pointing out this incredible paragraph i had missed in the postmortem:

With regard to automated testing, a test suite is a challenge to build for a feature of this type due to the non-deterministic output of an LLM. Therefore, one method we tried was to run explanations created by gpt-3.5-turbo-0301 for each of the 25,820 unique code examples on MDN and validate the output against GPT-4. The summary of the responses from this experiment tagged generated explanations as either accurate, somewhat inaccurate, or incorrect. We randomly sampled responses that were not considered high-quality and manually inspected and evaluated them.

(I am going to try to remain professional in this message, but i have to disclose that i am currently torn between screaming at my screen or laughing until i pass out.)

I have two major problems with this. The first and most obvious is that an LLM is not a good judge of the accuracy of the output of another LLM. As stated multiple times, LLMs predict text: they do not have a theory of mind, they do not reason about their input. Therefore, classifying the answers in such a way is at best extremely unreliable, and at worst entirely meaningless: this method of automated testing compounds errors between LLMs, which are already unreliable tools to start with. The resulting classification was very unlikely to properly reflect whether the generated response was accurate and helpful or not.

My second problem with this approach is that the review process, as described, only focused on the answers that were labelled as "inaccurate" by the LLM. This means that all answers classified as "accurate" were implicitly accepted as correct: the team behind this effort uncritically trusted the output of GPT-4 on that matter. This discredits the claim that "users will know not to blindly trust the results of a tool labelled as AI", when the team behind those features is guilty of doing exactly that.

This methodology means the results that were obtained were basically meaningless; the claim that the results were "helpful in enough cases" is therefore baseless. Those results should never have been used to justify a release. I would argue that this paragraph, on its own, discredits the entire project.

ghalfacree commented 12 months ago

Not to muddy the water or anything, but there's a further issue with both AI Help and AI Explain which doesn't seem to have been touched on yet: it's computationally, and therefore environmentally, incredibly expensive.

As others have pointed out, a big problem is that every time a user clicks on AI Help it sends a new request to the LLM API and generates a new response. These responses are not, to the best of my knowledge, cached: you click the button, the prompt is sent to the API, the LLM does its autocomplete-on-steroids parlour trick, and the response comes back.

So.. why? The bulk of the documentation on MDN does not change on a minute-by-minute basis. If I ask AI Help for assistance with a given topic today, any correct answer it gives me (by chance, rather than design) will also be correct tomorrow in the overwhelmingly vast majority of cases.

But what's happening here is that a static resource, written by humans, is being processed at great computational expense into something which is then immediately discarded... every single time someone clicks the button.

Luccioni et al, 2022, estimates the carbon footprint of BLOOM, a 176-billion-parameter LLM created by Hugging Face, at 24.7 tonnes of carbon dioxide equivalent on training energy alone. After deployment, the model drew 914kWh of electricity across 18 days of usage at an average of 558 requests per hour.

If we assume those measurements hold true for MDN, and usage of AI Help doesn't exceed an average of 558 requests per hour, then you're burning 1.5MWh of electricity every single month just to deliver constantly-shifting unvetted content which "may" be useful. That's on top of the current energy usage for delivering the actually-useful MDN content people come here to see.

That, to me, is ridiculous. 1.5MWh a month, every month, of direct consumption - not counting the energy used by your API provider in training, nor idle drain. Just in usage directly triggered by MDN.

nyeogmi commented 11 months ago

Hey, popping back in and saying that I enjoyed being accused of acting in bad faith for using AI Explain examples to demonstrate problems with AI Help. They're based on the same model and dataset and they're about equally prone to inaccuracy based on all the comments in this thread, but it was nice when caugner accused me of being deliberately dishonest and manipulative after I said "hey, my real life is on fire and I can't respond." Definitely do more of that. Thanks.

(TL;DR: Keep my name out of your mouth, I'm done. Have fun building the future.)

nyeogmi commented 11 months ago

Half-tempted to close this issue just so that people will complain about this somewhere other than the back alley of a GitHub repo no one reads. We can bully whatever random MDN dev draws the short straw and winds up here as much as we want, and we'll end up receiving an insincere apology which we will accept because -- guess what -- being bullied on GitHub sucks ass and shouldn't happen to anyone, including people who have decided "on this particular day I will go down on the ship."

Like, nothing is going to get done if we continue making life hell for people who have literally no power, even if those people are infuriatingly obtuse and unreceptive to criticism. I am not saying you should not be angry but you should stop being angry at the person who is being paid between five and six figures to be a punching bag between you and Steve Teixeira's terrible decisions. Why are we here of all places? This is pointless.

nyeogmi commented 11 months ago

Also:

My second problem with this approach is that the review process, as described, only focused on the answers that were labelled as "inaccurate" by the LLM. This means that all answers classified as "accurate" were implicitly accepted as correct: the team behind this effort uncritically trusted the output of GPT-4 on that matter. This discredits the claim that "users will know not to blindly trust the results of a tool labelled as AI", when the team behind those features is guilty of doing exactly that.

This is laughable and you all would have been fired if you weren't doing power's work for it. As it stands, you'll probably get a raise.

If you don't get one, then your dignity is very, very cheap.

akierig commented 11 months ago

incorrect information can still be helpful

this sentence encapsulates what made me decide to cancel my MDN Plus subscription.

As a librarian, I feel very strongly about information being well organized and accurate. Wedging a LLM into a project that I had, until now, frequently suggested to others makes me very sad. I rely on MDN a lot and I felt good about supporting it and suggesting it to others.

Truly breaks my heart to watch Mozilla crumble in real time.

tanepiper commented 11 months ago

This whole debacle is making W3CSchools more useful than MDN. That's embarrassing.

meejah commented 11 months ago

Since we apparently want "AI" in everything, here is why ChatGPT-3.5 thinks it would be a bad idea to use ChatGPT4 to interpret or produce technical documentation [*]:


As a seasoned web developer, I can provide several reasons why relying solely on a language model like ChatGPT4 for technical documentation is not a good idea:

  1. Lack of Contextual Understanding: While language models like ChatGPT4 are impressive in generating human-like text, they lack contextual understanding. They may generate plausible-sounding explanations, but they might not accurately comprehend the intricacies and nuances of technical concepts. This can lead to misleading or incorrect information being provided.

  2. Incomplete or Outdated Information: Language models are trained on vast amounts of data, but they might not have access to the most up-to-date information. Technology and programming languages evolve rapidly, and relying solely on a language model might result in outdated or incomplete documentation. Developers need accurate and current information to ensure their work is efficient and secure.

  3. Limited Problem-Solving Abilities: Technical documentation often involves troubleshooting and problem-solving. While ChatGPT4 can generate responses, it lacks the ability to actively debug or diagnose issues. Developers need guidance that goes beyond generic explanations and includes specific solutions to common problems, which a language model might not be able to provide.

  4. Lack of Domain-Specific Knowledge: Technical documentation requires domain-specific knowledge and expertise. It involves understanding the underlying technologies, frameworks, and best practices. Language models like ChatGPT4 are trained on a wide range of topics, but they may not possess the deep understanding required to provide accurate and reliable technical documentation.

  5. Absence of Real-World Experience: Technical documentation should ideally be written by experienced developers who have hands-on experience in implementing the concepts they document. Language models lack real-world experience and practical insights, which are crucial for providing meaningful and effective documentation.

  6. Inability to Adapt to User Needs: Developers often have specific requirements or constraints that need to be considered while working with technical documentation. Language models like ChatGPT4 generate responses based on general patterns and examples, but they might not be able to adapt to individual user needs or provide tailored solutions.

While language models like ChatGPT4 can be useful for generating text and providing general information, they should not be relied upon as the sole source of technical documentation. Developers should seek well-curated, accurate, and up-to-date documentation created by experts in the field to ensure they have the necessary guidance for their work.

[*] -- warning, this was produced by a LLM and may not be accurate.

Be-ing commented 11 months ago

This is pointless.

I agree. These wall-of-text posts aren't going to get them to change. So, take it out of their control. If I did web development more than on rare occasions, I'd already be using the content and code to setup a fork of MDN on a new domain with all this LLM nonsense removed.

faintbeep commented 11 months ago

Sorry to be another voice chiming in, but it seems the thread is lacking examples of bad responses from AI Help, leading to a dismissal of the issues as being only related to AI Explain. So here's a very bad response I got from AI Help about reacting to element size changes:

To detect changes in the size of an element, you can use the MutationObserver API in JavaScript. Here's an example of how you can use MutationObserver to detect size changes of an element:

// Select the node that will be observed for size changes
const targetNode = document.getElementById("your-element-id");

// Options for the observer (which mutations to observe)
const config = { attributes: true, attributeFilter: ["style"] };

// Callback function to execute when size changes are observed
const callback = (mutationList, observer) => {
  for (const mutation of mutationList) {
    if (mutation.type === "attributes" && mutation.attributeName === "style") {
      console.log("The size of the element has changed.");
      // Perform your desired actions here
    }
  }
};

// Create an observer instance linked to the callback function const observer = new MutationObserver(callback);

// Start observing the target node for size changes observer.observe(targetNode, config);

// Later, you can stop observing observer.disconnect();

In this example, we select the element that you want to observe for size changes using getElementById. Then, we define the options for the observer, specifying that we want to observe changes to the style attribute of the element.

The callback function is executed when size changes are observed. It checks if the mutation type is attributes and if the attribute name is style. If both conditions are met, it logs a message indicating that the size of the element has changed. You can perform your desired actions within this callback function.

Finally, we create an instance of MutationObserver linked to the callback function and start observing the target node for size changes. When you no longer need to observe size changes, you can call disconnect() on the observer to stop observing.

Remember to replace "your-element-id" with the actual ID of the element you want to observe for size changes.

MDN content that I've consulted that you might want to check:

MutationObserver MutationObserver: MutationObserver() constructor

This is obviously wrong in every way, and I'd be surprised to see someone defend it as inaccurate but useful.

Full disclosure, I deliberately tricked the LLM by asking how to use MutationObserver for this purpose. But IMO that's a question a confused beginner is likely to ask, and the documentation should correct them rather than hallucinate a world in which they are correct.

FinalStaticFox commented 11 months ago

What is even the point of this feature? If I wanted to ask ChatGPT for explanations, I would just... you know... go on OpenAI's website and do it. With all the precautions it comes with, like staying skeptical of all output because of the non-negligible risk of it being confidently wrong.

But so far I never even felt the need to ask a LLM about MDN documentation, because it is well written and sufficient. So, at best this AI help button is useless, and at worst it is harmful, because of the risk that someone might end up misinformed by the output.

Also the idea that "incorrect information can still be helpful" is asinine. This is technical documentation, not Fox News.

DoubleHyphen commented 11 months ago

So let me get this straight… the objections to the original post were that the answers output are correct—for the most part?

yoe commented 11 months ago

@faintbeep,

Full disclosure, I deliberately tricked the LLM by asking how to use MutationObserver for this purpose. But IMO that's a question a confused beginner is likely to ask, and the documentation should correct them rather than hallucinate a world in which they are correct.

This is actually a generic problem with ChatGPT: if you ask it something that is impossible, it simply cannot tell you that what you ask is impossible; instead, it will hallucinate a world wherein the thing you ask for is in fact possible and then come up with an overly elaborate answer, with full code examples and everything, but it will never work because it's not possible and it does not have the ability to tell you this.

I don't know whether this is a generic problem with LLMs or a specific problem with ChatGPT, but on all the interactions that I've had with it, I've never seen it tell me that a thing is impossible, and believe me, this was not for lack of trying.

In other words, ChatGPT is an XY problem amplifier. You want to do something with an API that wasn't made to do the something, you ask the tool in MDN how to do that, it will hallucinate some gibberish for you that makes it sound like it's possible, and now you're stuck even further in your XY problem.

This is not something MDN should be doing, ever, but it does, both with AI Help and with AI Explain.