The AI help button is very good but it links to a feature that should not exist

mdn / yari

The platform code behind MDN Web Docs

Mozilla Public License 2.0

1.16k stars 486 forks source link

The AI help button is very good but it links to a feature that should not exist #9230

Open nyeogmi opened 12 months ago

nyeogmi commented 12 months ago

Summary

I made a previous issue pointing out that the AI Help feature lies to people and should not exist because of potential harm to novices.

This was renamed by @caugner to "AI Help is linked on all pages." AI Help being linked on all pages is the intended behavior of the feature, and @caugner therefore pointed out that the button looks good and works even better, which I agree with -- it is a fantastic button and when I look at all the buttons on MDN, the AI Help button clearly stands out to me as the radiant star of the show.

The issue was therefore closed without being substantively addressed. (because the button is so good, which I agree with)

I think there are several reasons the feature shouldn't exist which have been observed across multiple threads on platforms Mozilla does not control. Actually, the response has been universally negative, except on GitHub where the ability to have a universally negative response was quietly disabled Monday morning.

Here is a quick summary of some of those reasons.

One, the AI model is frequently wrong. Mozilla claims it intends to fix this, but Mozilla doesn't contain any GPT-3.5 developers and OpenAI has been promising to fix it for months. It's unlikely this will actually happen.

Two: contrary to @caugner 's opinion, it's very often wrong about core web topics, including trivial information where there is no obvious excuse. Here are some examples:

Even examples posted by people who support the existence of the AI contain significant errors:

In an example meant to show that the AI is useful, the AI is wrong about the clip: property.

(I say examples, but note: this is the only usage example provided by a person who supported the existence of the feature, and it contained an error.)

This is identical to one of the categories of problem seen on StackExchange when StackExchange introduced its generative AI assistant based on the same model, and it led to Stack removing the assistant because it was generating bizarre garbage.

Three: it's not clear that any documentation contributors were involved in developing the feature. Actually, it's still unclear who outside of @fiji-flo and @caugner was involved in the feature. Some contributors including @sideshowbarker have now objected and the process has produced a default outcome, which is that AI Explain was voluntarily rolled back and AI Help remains in the product.

It is probably OK for those contributors to review each other's own code, but they're also managing the response to the backlash. After a bunch of people have already signaled "hey, I have an active interest in this feature" by engaging with a relevant issue, excluding those people reflects that a ruling of "actually, you do not have an active interest!" has been reached, and it's not clear what basis that ruling would have been reached on.

Four: the existence of this feature suggests that product decisions are being made by people who don't understand the technology or who don't think I understand it.

Overall, the change tells the story that MDN doesn't know who their average user is, but assumes that the average user is (1) highly dissimilar to the GitHub users who were involved in the backlash (2) easy to sell to.

The fact is that in one day, measured in upvotes, you attracted comparable backlash to what the entire StackOverflow strike attracted in a month. It would be a mistake to think only a small group of people are concerned. This attitude would be wishful thinking.

It seems like the fork in the road for MDN is:

run at a loss and hope that corporate sponsors like Google continue to fund it anyway. (It's hard to imagine this happening because almost all of the value of MDN is created inside the OWD team, which is funded separately.)
mimic Mozilla's strategy: increasingly productize itself to users via features like the AI assistant, and increasingly productize itself to corporations by selling its credibility
stop existing

If option 1 isn't sustainable, then between option 2 and option 3, option 3 is obviously better for humanity in the long-run and I would encourage MDN to make plans for its own destruction.

In the worst possible world, the attitude is correct and the users are easy to sell to. Well, in that case, you've created another product company and in doing so you've metaphorically elected to serve both God and money -- and as is evidenced by the recent implosions of every siloed social media company, that is always a great idea.

Again, the AI Help button is absolutely gorgeous and functions as intended. This issue is not about the AI Help button and therefore should not be closed as a button-related wontfix, or renamed by @caugner into a description of the behavior of the button.

URL

https://github.com/mdn/yari/issues/9208 https://github.com/mdn/yari/issues/9214

Reproduction steps

Pivot to a more aggressive funding model, then engage in a mix of panic and corporate groupthink.

Expected behavior

I think the button is amazing and you are doing a great job.

Actual behavior

The AI help feature should not exist.

Device

Desktop

Browser

Chrome

Browser version

Stable

Operating system

Windows

Screenshot

Anything else?

No response

Validations

[X] I have read the Community Participation Guidelines.
[X] I have verified that there isn't already an issue that reports the same bug to avoid creating a duplicate.
[X] I have checked that this is a concrete bug. For Q&A open a GitHub Discussion.

NoraCodes commented 12 months ago

I just want to agree with this report wholeheartedly. The use of large language models to offset labor is problematic enough, but doing so when those LLMs do not even consistently produce reasonable or correct output is utterly unconscionable. MDN is supposed to be a simple, authoritative source for the web platform; with the introduction of "AI Help", you're throwing that reputation away. I never would have imagined I'd be recommending w3schools over MDN to new programmers, but that's where we are today.

I'm a long-time Firefox user. I've worked with Mozillans in the past, including on the 2nd edition of Programming Rust. I know you're decent people; do the right thing and ditch the AI bullshit.

sideshowbarker commented 12 months ago

To provide some context here about the relationship of OWD to MDN and about my own role in all this:

OWD funds the work of a group of writers, whose names you can find at https://openwebdocs.org/team/#writers — and the funding for OWD itself is organized through an Open Collective, which has a formal Team, the names of whose members you can find under the Team tab at https://opencollective.com/open-web-docs#section-contributors.

While I am among the 150+ individual people who have donated to OWD, I am neither formally one of the OWD writers nor formally one of the OWD Team members.

To be clear on my actual role: I’m one of the core reviewers/maintainers who have push/merge access to the https://github.com/mdn/content/ repo (the content of MDN) doing reviews of incoming PRs and otherwise contributing to the repo. The set of core reviewers/maintainers includes the OWD writers, but it also includes some writers who work for Mozilla, and includes me and some others who are neither formally OWD writers nor writers from Mozilla.

See https://github.com/orgs/mdn/teams?query=@sideshowbarker for the list of relevant GitHub teams I belong to, and https://github.com/mdn/content/pulls?q=reviewed-by:sideshowbarker for the reviews I’ve done (3858 so far) and https://github.com/mdn/content/graphs/contributors to see my own commits (and those of other contributors).

And FWIW here I’ll mention that I actually also have push/merge access to the Yari repo at https://github.com/mdn/yari/ repo, which has the source code for the platform on which MDN runs — including code for things like the “AI Explain” button, but also code for all kinds of good things that aren’t controversial at all.

I am not a core Yari reviewer/maintainer, but I have actually done reviews there (20 so far), as shown in https://github.com/mdn/yari/pulls?q=is:pr+reviewed-by:sideshowbarker — in cases where it has made sense for me to review — and commits (42 so far), as shown in https://github.com/mdn/yari/commits?author=sideshowbarker.

resuna commented 12 months ago

I do not believe there is currently a theoretical framework for making statistical text generation distinguish the truth of a statement so there is no likelihood of this being fixed with any anticipated development based on the current technology.

nyeogmi commented 12 months ago

@sideshowbarker I think all your changes are fair and match up with places where I was irritated when writing the issue to the point of including content I retrospectively wouldn't defend. I don't want to make you look like you're replying to nothing, though. Would it be alright with you if I edited the original post not to include any of the contentious comments?

Xe commented 12 months ago

I'm not sure if this is the right implementation, but the idea is on the right track. A better way to implement this with large language models should that be desired is to have an AI generate a bunch of options for knowledgeable human technical writers to pick from and then form a composite to ensure that the answers are technically accurate and using large language models to assist the creative flow.

i think that the current implementation is wishful thinking at best and I am sad to see such a critical web resource fall prey to hype cycles that cut out the best part of MDN: the technical writers. Hopefully what I proposed is a viable middle path.

nyeogmi commented 12 months ago

I do not believe there is currently a theoretical framework for making statistical text generation distinguish the truth of a statement so there is no likelihood of this being fixed with any anticipated development based on the current technology.

As far as I can tell, the framework AI Help is using is described here. Basically, it feeds the posts to a search engine and then uses the search engine to make sure at least one relevant MDN doc can be surfaced to the LLM before it outputs anything.

The idea seems to be "well, AIs are better at summarizing text and doing question/answer tasks about specific passages than they are at answering questions off the cuff," which I think is probably true.

(Does this work? I don't know. When I tried the tool, I was trying to trick it, and it mostly just told me "I can't answer that," which I suspect means it was falling over at the search engine step.)

I would say this is actually really close to the model AI Explain used, so I would expect it to produce similar mistakes. From talking to Augner, it sounds like Augner doesn't believe any examples taken from AI Explain are representative of likely weaknesses in AI Help, which is surprising to me, but that appears to be their current position.

Overall, I think an affirmative case for "AI would be good at this task" is still missing. Augner wants an affirmative case that it won't work, I want an affirmative case that it will, so we're basically talking past each other.

fiji-flo commented 12 months ago

AI Help works very different than AI Explain. We restrict the LLM to base it answers on the context we give it (which is actual up to date MDN content). So you won't see the issues AI Explain was exhibiting.

A basic helpful answer (to the following question) would be:

How can I detect that I'm in offline mode?

To detect if you are in offline mode, you can use the navigator.onLine property. This property returns a boolean value, with true indicating that the browser is online and false indicating that it is offline.

Here is an example of how you can use navigator.onLine to check if you are in offline mode:

if (navigator.onLine) {
  console.log("Online");
} else {
  console.log("Offline");
}

In this example, if navigator.onLine returns true, it means that the browser is online, and the message "Online" will be logged to the console. If navigator.onLine returns false, it means that the browser is offline, and the message "Offline" will be logged to the console.

Please note that if the browser does not support the navigator.onLine property, the above example will always return false or undefined.

If you want to listen for changes in the network status, you can use the online and offline events. Here is an example of how to do that:

window.addEventListener("offline", (e) => {
  console.log("Offline");
});

window.addEventListener("online", (e) => {
  console.log("Online");
});

By adding event listeners for the offline and online events, you can execute specific code when the browser goes offline or comes back online.

MDN content that I've consulted that you might want to check:

Zarthus commented 12 months ago

@fiji-flo it looks like the links you've provided are broken. (maybe the urls were relative?)

My main concern is that AI Help does not have a place in technical documentation. Yes, in theory it could help out a few people, but the target audience it seems to aim for (new developers or someone unfamiliar with the concept it is trying to learn about) coupled with our current understanding and research about LLMs (in a nutshell; they can confidently present inaccurate information) seems to be a hugely concerning mismatch.

You need someone to fact-check the response from a LLMs, a four eye principle is often applied on technical docs (one writer, and at least one reviewer) which is missing from the LLM.

Therefore, there is a significantly increased risk that the LLM provides wrong information to someone not knowledgeable enough about the subject to fact-check if the AI is confidently providing misinformation, or is actually accurate.

How does the team behind AI Explain hope to alleviate this concern, beyond plastering the user with warnings (which might be hint that this is not a product-market fit?)

caugner commented 12 months ago

Here's another helpful answer for the following question about a brand-new web feature:

How does oklch() relate to oklab()?

The oklch() functional notation is related to the oklab() functional notation as they both express colors in the Oklab color space. However, they differ in the way they represent colors.

While oklab() uses a Cartesian coordinate system with the a and b axes to represent colors, oklch() uses polar coordinates with the C (Chroma) and H (Hue) values. Both notations share the same L axis, which represents the perceived lightness of the color.

In oklch(), the C value represents the chroma or the intensity of the color, while the H value represents the hue or the angle of the color on the color wheel.

Using oklch(), you can express colors in the Oklch color space by specifying the L, C, and H values. This notation allows you to work with polar coordinates, which can be useful for certain color manipulations or when you want to represent colors in a different way than the Cartesian coordinates used in oklab().

Here is an example of using the oklch() notation:

[data-color="blue"] {
  background-color: oklch(50% 70 240);
}

In this example, the oklch() function is used to set the background color of an element to a blue color with a lightness of 50%, a chroma of 70, and a hue of 240 degrees.

Overall, oklch() provides an alternative way to represent colors in the Oklab color space using polar coordinates, while oklab() uses Cartesian coordinates.

MDN content that I've consulted that you might want to check:

Zarthus commented 12 months ago

@caugner I think it's a little bit impolite to respond to @nyeogmi with an almost completely automatically generated answer (not relevant to the subject). Would you like to respond to their message (the main content of the issue) as well please?

caugner commented 12 months ago

@Zarthus Both this and that comment respond to @nyeogmi who requested positive examples of answers produced by AI Help:

Overall, I think an affirmative case for "AI would be good at this task" is still missing. (...), I want an affirmative case that it will, so we're basically talking past each other.

Zarthus commented 12 months ago

@caugner: If that was the essence and what the contributors of AI Explain and AI Help have taken away from this issue, and is their official response to this issue, I shall pardon myself from this thread.

rileyinman commented 12 months ago

This is honestly quite embarrassing. I've been a vocal proponent of Mozilla, their products, and MDN for quite a long time. Seeing the consistent non-acknowledgment of perfectly valid, calmly laid out reasoning against this feature in its current state is disheartening. If Mozilla is set on its current path and will refuse to bend to criticism on this feature, at least do the service of outright saying so - then we can all stop wasting our time.

obfusk commented 12 months ago

I really really didn't want to be part of this discussion. But if people are worried about this feature producing convincing but inaccurate/wrong/misleading output (which LLMs are known to do), providing examples of correct output will not convince them. That only proves that the LLM is capable of being correct and useful (which I don't think anyone has disputed). Not that it is likely to be correct most of the time. Nor that it will not provide really bad results some of the time. Nor does it address the issue that users may not be able to tell these cases apart.

It's really easy to create an algorithm that produces correct output some of the time, or even most of the time, but that fails spectacularly in some (edge) cases. That may be acceptable if it's clear beforehand when it will fail, so that people can avoid the edge cases, or if it's easy to tell when it has failed. But algorithms are a lot more predictable than LLMs. You can usually at least prove they are correct under certain conditions. LLMs are much harder to predict. And we know that LLMs can "hallucinate" perfectly convincing but non-existent sources for their claims.

Even if the LLM produces accurate, useful, output 99% of the time, can I know whether the output I'm currently getting is in fact accurate without fact-checking it every time?

joepie91 commented 12 months ago

@Zarthus Both this and that comment respond to @nyeogmi who requested positive examples of answers produced by AI Help:

Overall, I think an affirmative case for "AI would be good at this task" is still missing. (...), I want an affirmative case that it will, so we're basically talking past each other.

My understanding is that they were requesting an affirmative case to be made for it being structurally good at this task, rather than providing an individual question that it managed to answer sufficiently accurately (which does not say much about structural fitness for the task).

caugner commented 12 months ago

@sideshowbarker Please stop hiding or deleting comments in this repository. Thank you!

eevee commented 12 months ago

Please note that if the browser does not support the navigator.onLine property, the above example will always return false or undefined.

what? if a property isn't supported, it will always be undefined. it would make no sense for a browser to specifically define navigator.onLine to only be false.

most of that answer is complete fluff, and more importantly it does not really answer the original question — because of exactly the problem that it struggles to raise. if you want to know for sure that you're in offline mode, you would have to check navigator.onLine === false, so you know you're not mistaking lack of support for being offline.

The oklch() functional notation is related to the oklab() functional notation as they both express colors in the Oklab color space. However, they differ in the way they represent colors.

...

Using oklch(), you can express colors in the Oklch color space...

so are they the same colorspace or not? this seems like the crux of the question, but the bulk of the response is rambling that rephrases parts of the linked articles (including repeated mention of the cartesian/polar distinction, which i doubt will help someone who isn't already visualizing a colorspace in their head), rather than a direct answer. it's mostly explaining oklch() and barely touching on oklab() at all.

a good direct answer would probably say that you just want oklch() if you're not already familiar with Lab's A/B axes and how they correspond to colors. instead we get "here's what blue looks like", without even explaining why it's blue.

but an LLM can't give an answer like that, because it doesn't understand context, or what common sticking points might look like. or anything at all. all it can do is babble, and hopefully not babble something that's incorrect.

but you can't ever be confident that it won't be wrong about some percentage of arbitrary questions. and if it is wrong, you can't directly correct it the way you might correct a static article. all you can do is keep feeding it more text and cross your fingers that it starts babbling more correctly, in an infinite game of whack-a-mole.

it might seem like i'm being nitpicky here. and i am — because these examples were specifically cherry-picked to defend the existence of the feature itself. they are the best case scenario. and they are, charitably, mediocre.

ultimately, if you create a chatbot (which you explicitly call "trusted"!) that can't really do much more than restate the contents of existing articles, and you're relying on the reader to sift through its rambling to find the actual information they asked for... then what was the point? they could just as well have sifted through the articles themselves to find the information they wanted, without risking that crucial details will get lost or twisted.

kyanha commented 12 months ago

I would say this is actually really close to the model AI Explain used, so I would expect it to produce similar mistakes. From talking to Augner, it sounds like Augner doesn't believe any examples taken from AI Explain are representative of likely weaknesses in AI Help, which is surprising to me, but that appears to be their current position.

Overall, I think an affirmative case for "AI would be good at this task" is still missing. Augner wants an affirmative case that it won't work, I want an affirmative case that it will, so we're basically talking past each other.

I think the affirmative case for Augner should be "there are many examples of incorrect information being provided already cited."

I'd like to read what precisely the proponents think it's going to help with.

sideshowbarker commented 12 months ago

@sideshowbarker Please stop hiding or deleting comments in this repository. Thank you!

Lest anyone else here be led to believe I hid or deleted any comments nefariously or something: Allow me to be fully transparent about exactly what did actually hide and delete —

So, for the record here: The only comments I hid or deleted were completely innocuous cleanup of outdated comments related to updates that got made to the issue description. (See the remaining related comment at https://github.com/mdn/yari/issues/9230#issuecomment-1620783541.)

Specifically: I had posted a comment correcting some things that had been in the issue description, and there were some follow-up comments from the OP and another commenter about that — and then the issue description was subsequently updated based on my corrections.

So that update of the issue description rendered all those comments outdated and no-longer-necessary, and they were therefore amicably deleted by agreement with the OP — with the point being that keeping those comments hanging around would have just been noise that distracted from the substance of the discussion here.

aardrian commented 12 months ago

@fiji-flo (and @caugner)

AI Help works very different than AI Explain. We restrict the LLM to base it answers on the context we give it (which is actual up to date MDN content). So you won't see the issues AI Explain was exhibiting.

I tried to confirm this assertion by pasting some code into AI Help and asked it to explain the code. I used my first CSS example from issue 9208 (I do not have an account, so I don't want to use up my free checks for today)

For the example, I got this final paragraph (after the LLM explained each included property that visually hides the pseudo-content):

These pseudo-elements and their styles are used to visually indicate the start and end of strikethrough text. However, it's important to note that the presence of the s element is not announced by most screen reading technology in its default configuration. To make it accessible, additional techniques may need to be applied.

I italicized the part that seems questionable given the context it just provided (that the styles visually hide the content it claims visually indicates the start and end of an element).

I agree that it seems less overtly wrong, but it is still wrong. In a more subtle way.

caugner commented 12 months ago

@aardrian Can you please use the (new) "Report a problem with this answer on GitHub" link at the bottom of the AI Help answer, so that the team can follow up on the specific problem you're experiencing? Thanks! 🙏

Ultrabenosaurus commented 12 months ago

@aardrian Can you please use the (new) "Report a problem with this answer on GitHub" link at the bottom of the AI Help answer, so that the team can follow up on the specific problem you're experiencing? Thanks! 🙏

@aardrian's comment is valid in this thread.

Encouraging users to report each incident separately seems like "divide and conquer" tactics to obscure the true scale and prevalence of the problem. By chopping it up into smaller, specific blocks they can be "addressed" with cherry-picked responses as attempted earlier in this thread, only with less context due to being isolated single Issues, not contributing to the overall picture.

Like how @nyeogmi's previous issue was renamed to obfuscate the real problem being raised, and then closed without addressing said problem properly and prompting the creation of this Issue. And how #9208 was also renamed to obfuscate and downplay the very concerning issue being discussed.

aardrian commented 12 months ago

@caugner

Can you please use the (new) "Report a problem with this answer on GitHub" link at the bottom of the AI Help answer…

No. First, I am already giving my free labor by engaging on this (versus swearing off MDN) anḋ second, what @Ultrabenosaurus said.

GabrielRavier commented 12 months ago

@sideshowbarker Please stop hiding or deleting comments in this repository. Thank you!

Given that one of the comments that were deleted was mine, I'd like to further emphasize that what @sideshowbarker said in https://github.com/mdn/yari/issues/9230#issuecomment-1622233148 is in fact completely accurate: my comment (along with other deleted ones) related entirely and only to minor cleanup and did not need to be present after that was cleared up. I have no issue at all with the deletion of the comment and fully agree that leaving it there would just have cluttered things up.

nyeogmi commented 12 months ago

My understanding is that they were requesting an affirmative case to be made for it being structurally good at this task, rather than providing an individual question that it managed to answer sufficiently accurately (which does not say much about structural fitness for the task).

I don't think the value of good examples is literally zero. But if advocates of the feature are rejecting isolated examples of bad answers as evidence that the feature is bad, then I am reluctant to accept isolated examples of good answers as evidence that the feature is good.

Specifically: if we accuse one side of cherry picking w/o specific basis, we have to accuse both sides of cherrypicking and throw out all the examples. If we just take everyone's evidence at face value, we conclude that it produces both good and bad answers with roughly equal likelihood, which is more consistent with the case that it's bad.

GabrielRavier commented 12 months ago

My understanding is that they were requesting an affirmative case to be made for it being structurally good at this task, rather than providing an individual question that it managed to answer sufficiently accurately (which does not say much about structural fitness for the task).

I don't think the value of good examples is literally zero. But if advocates of the feature are rejecting isolated examples of bad answers as evidence that the feature is bad, then I am reluctant to accept isolated examples of good answers as evidence that the feature is good.

Specifically: if we accuse one side of cherry picking w/o specific basis, we have to accuse both sides of cherrypicking and throw out all the examples. If we just take everyone's evidence at face value, we conclude that it produces both good and bad answers with roughly equal likelihood, which is more consistent with the case that it's bad.

Well, I also think the fact that the side submitting "good examples" is actually submitting examples that seem superficially good but actually have large problems is also quite relevant.

nyeogmi commented 12 months ago

Well, I also think the fact that the side submitting "good examples" is actually submitting examples that seem superficially good but actually have large problems is also quite relevant.

Yeah, when I look at messages other than the one I was at-ed in, I agree this is pretty problematic! Among other things it decreases my faith in the devs' ability to actually determine if the feature is working.

caugner commented 12 months ago

Hi @eevee, 👋

Thank you for your constructive analysis. I appreciate that we’re finally talking about AI Help specifics.

what? if a property isn't supported, it will always be undefined. it would make no sense for a browser to specifically define navigator.onLine to only be false.

I agree with you, I would also expect an unsupported property to be undefined, rather than false. But looking at the content that AI Help consulted, this is exactly what’s written on the Navigator.onLine page. I hope we can agree that this is an MDN content issue, not an AI Help issue.

most of that answer is complete fluff, and more importantly it does not really answer the original question — because of exactly the problem that it struggles to raise. if you want to know for sure that you're in offline mode, you would have to check navigator.onLine === false, so you know you're not mistaking lack of support for being offline.

I'm not sure what parts of the answer you deem “complete fluff”, but if you’re criticizing the code example with if (navigator.onLine), then this is again just taken from the same MDN page, rather than invented by AI Help. But you also seem to dismiss the fact that all browsers support the property since at least 2015 (according to BCD), so it’s not strictly needed to check navigator.onLine === false.

so are they the same colorspace or not? this seems like the crux of the question, but the bulk of the response is rambling that rephrases parts of the linked articles (including repeated mention of the cartesian/polar distinction, which i doubt will help someone who isn't already visualizing a colorspace in their head), rather than a direct answer. it's mostly explaining oklch() and barely touching on oklab() at all.

You're right, the initial statement about both notations expressing colors in the Oklab color space is not correct, but further below it does mention the Oklch color space in the context of oklch(). I acknowledge that bit of incorrectness (and we might have a fix in the pipeline), but otherwise the response looks good to me. We don’t explain color spaces on MDN, so it is expected that AI Help doesn’t go more into detail. The response is based solely on the oklab() and oklch() pages, and rephrasing relevant parts of the linked articles sounds like a reasonable way to tackle the question.

Not knowing anything about these features before, I certainly found the answer helpful, almost more helpful (even if less comprehensive) than reading the articles separately.

a good direct answer would probably say that you just want oklch() if you're not already familiar with Lab's A/B axes and how they correspond to colors. instead we get "here's what blue looks like", without even explaining why it's blue.

If one of the two pages would include that advice, then it could definitely be part of the AI Help answer, but since AI Help's answer is (intentionally) limited to the content of those MDN pages, it won't give the answer you're suggesting.

but an LLM can't give an answer like that, because it doesn't understand context, or what common sticking points might look like. or anything at all. all it can do is babble, and hopefully not babble something that's incorrect.

but you can't ever be confident that it won't be wrong about some percentage of arbitrary questions. and if it is wrong, you can't directly correct it the way you might correct a static article. all you can do is keep feeding it more text and cross your fingers that it starts babbling more correctly, in an infinite game of whack-a-mole.

AI Help doesn't need to understand the content in order to produce helpful answers of good quality based on MDN content. We don't claim that AI Help answers are 100% correct, and they can't be, because MDN content doesn't seem to be 100% correct either, as we have seen above.

it might seem like i'm being nitpicky here. and i am — because these examples were specifically cherry-picked to defend the existence of the feature itself. they are the best case scenario. and they are, charitably, mediocre.

Yes, you don't give us the benefit of the doubt if you assume that these examples were cherry-picked. My example was a question I had never asked AI Help before, about features I first heard about these days when someone pointed out that AI Explain did a bad job explaining the oklch() code example. Even the pre-defined examples you'll find on the AI Help page aren't cherry-picked.

ultimately, if you create a chatbot (which you explicitly call "trusted"!) that can't really do much more than restate the contents of existing articles, and you're relying on the reader to sift through its rambling to find the actual information they asked for... then what was the point? they could just as well have sifted through the articles themselves to find the information they wanted, without risking that crucial details will get lost or twisted.

Let's leave it to our users to decide whether they prefer to ask a question, or read through articles instead, or both.

The quantitative feedback we have received so far suggests that the feature is used and we get much more "This answer is helpful" than "This answer is not helpful" votes.

As for qualitative feedback, we have only got 3 reports for AI Help answers so far: The first one was answered correctly, but the code example was incomplete; the second one was confusing two identically named features; and the third one was not a question that could be answered with MDN content. In all three cases, we already identified a solution for the underlying problem. There is a reason the feature is declared "beta".

Anyways, thanks again for being constructive, @eevee.

And to everyone following this thread: Please try out AI Help for yourself and make sure to report problems with answers using the corresponding link.

cgranade commented 12 months ago

I've avoided weighing in on this because, while I am a software developer, I am not a web developer in particular. Reading this discussion, though, I feel like it's worth chiming in precisely because I'm not a web developer — I'm someone who sometimes needs to look up web stuff when scientific or quantum software development interacts with web development, and I've long valued MDN as a resource that I can use without needing to be an expert in web development specifically.

Adopting large language models, a form of machine learning models optimized only for plausibility and not for accuracy, fundamentally changes that dynamic for me. I do not have the expertise needed to tell immediately if a sample presents an accessibility issue, is just malformed, or is otherwise incorrect or dangerous. I rely on reference and explanatory material written and maintained by experts who check the accuracy of that content, but that's wholly inconsistent with adopting LLMs as a technical reference.

I'll also add that even if the "feature" is locked to some parts of the site or some workflows, the fact that it was adopted with so little discussion and is currently being defended over the concerns of many of the experts I've come to rely on is deeply disturbing. If it's acceptable to give up technical accuracy to hop on the "AI" bandwagon, what else am I missing about MDN reference material that might make it unsafe for me to rely on?

obfusk commented 12 months ago

The quantitative feedback we have received so far suggests that the feature is used and we get much more "This answer is helpful" than "This answer is not helpful" votes.

That tells you that the user considered it helpful. It doesn't tell you whether it was correct. It's easy to mistake a plausible, confident-sounding -- but actually misleading or incorrect -- answer for a helpful one. I see no safeguards to prevent this.

resuna commented 12 months ago

@aardrian Can you please use the (new) "Report a problem with this answer on GitHub" link at the bottom of the AI Help answer, so that the team can follow up on the specific problem you're experiencing? Thanks! 🙏

I think the burden of proof is on Mozilla, since there is no basis for assuming that any LLM or related generative text software can be relied upon to produce true output. There is no mechanism for such a program to even distinguish true from false simulated output. It produces output that "looks like" it belongs in the corpus, that's all.

esotericist commented 12 months ago

i find this specific claim highly concerning:

AI Help doesn't need to understand the content in order to produce helpful answers of good quality based on MDN content. We don't claim that AI Help answers are 100% correct, and they can't be, because MDN content doesn't seem to be 100% correct either, as we have seen above.

the first sentence indicates a fundamental misapprehension about what is going on; the existence of only correct, high quality training data does not in any way directly guarantee results in the form of correct, high quality llm output. you can put in all the right things, and get wrong things out again, because all an llm knows how to do is string words together in the right shape; it knows nothing about what makes those things right which means it is perfectly capable of using them wrongly.

the second sentence is touching on something that i think gets missed amongst the other (remarkably well-written) critiques, specifically:

mdn articles can approach correctness if only because sufficiently knowledgable people can look at the articles, go "ah, no, this is wrong" and submit a complaint.

the key is this requires sufficiently knowledgable readers to see and notice it... the people who are going to be harmed by incorrect answers, by definition, will not know to report the problem. so this only works if we have the sufficiently knowledgable readers seeing the same content as the actual target audience of the functionality.

this system is fundamentally incapable of allowing that process to occur.

Zarthus commented 12 months ago

This sounds like a case where data driven development and decision making is being used completely wrong, "100 helpful flags, 3 unhelpful flags" is a really great metric, but perhaps you should re-evaluate your business KPIs for the product if that is your primary metric for deeming AI helps success.

You have a ton of really great feedback from some really great community members that actively use MDN and have contributed to it, some of them also run consultancy services, perhaps it makes sense for Mozilla to bring some of those people on board to re-evaluate the product and pivot it into a direction that makes sense.

Be-ing commented 12 months ago

The quantitative feedback we have received so far suggests that the feature is used and we get much more "This answer is helpful" than "This answer is not helpful" votes. As for qualitative feedback, we have only got 3 reports for AI Help answers so far

I find it amazing that you're choosing to ignore the 1287 people that already told you they want none of this LLM stuff at all in favor of an obviously biased survey that you designed to produce the kind of data you want to suit your narrative. I'm sure you'll split hairs and claim that's not relevant because you think it was about a separate feature to further ignore the immense amount of almost unanimously negative feedback you've already gotten.

nyeogmi commented 12 months ago

Unless something has changed, the current survey system has two chances to introduce sampling bias:

if you don't click AI Help, you don't see the survey
if you don't have an MDN account, you are silently directed to a different survey and your data isn't counted

In both cases, people who don't think AI Help is a good idea are filtered out. This is fairly likely to correlate with the dependent variable you're trying to measure, because people who understand how LLMs work are unlikely to try the feature because they have abstract reasons to believe it is a bad idea.

Even more directly: if people already think a thing sucks, they're unlikely to put effort into explicitly negatively reviewing the thing. They are way more likely to just not use the thing.

For helpful/not helpful, the same filters effectively exist for different reasons. I get that automatically following up on users' responses inherently biases towards collecting more feedback from people who are using the thing more, so it's hard to fix the experimental design -- but "we're not equipped to run this with a good experimental design" doesn't validate data produced by a bad (and possibly even deliberately biased) experimental design.

(Note that I suspect caugner is comfortable discounting any feedback from anyone who hasn't tried the feature. They may not see this as a sampling bias problem in that case, because they've defined the population in a different way. In general, I think the attitude "if you haven't tried it, you shouldn't criticize it" is bad, but it's especially bad if you apply it to people who are self-selecting out of a sample. Some of the people who clearly should be counted in a study on this feature are likely unaware that they are going uncounted.)

Seirdy commented 12 months ago

The quantitative feedback we have received so far suggests that the feature is used and we get much more "This answer is helpful" than "This answer is not helpful" votes.

@caugner To me, this is the most worrying part of the LLM features added to MDN. The harms dealt by a misinformation tool are magnified by its perceived helpfulness or trustworthiness.

Multiple passable examples don't eclipse a significant minority of bad examples, if they even are a minority.

eevee commented 12 months ago

I agree with you, I would also expect an unsupported property to be undefined, rather than false. But looking at the content that AI Help consulted, this is exactly what’s written on the Navigator.onLine page. I hope we can agree that this is an MDN content issue, not an AI Help issue.

agreed. which is good, because MDN content can be corrected.

I'm not sure what parts of the answer you deem “complete fluff” but if you’re criticizing the code example with if (navigator.onLine), then this is again just taken from the same MDN page, rather than invented by AI Help.

it's mostly explaining how to evaluate a boolean, which is day 1 javascript syntax. someone who needs to check whether the browser is online is surely beyond the point where they need to know what an if looks like. it makes sense as an obligatory example in a larger article, but not as the main substance of an answer like this. (perhaps this is a consequence of "Always include code snippets if available.")

i'm only just now looking at the original article, and there's a big block of introductory prose about what the property actually indicates, none of which made it into the answer. that's unfortunate. it's the sort of thing i may not have thought to ask about.

But you also seem to dismiss the fact that all browsers support the property since at least 2015 (according to BCD), so it’s not strictly needed to check navigator.onLine === false.

wait a second. i took the generated answer at face value and assumed that browser support was a concern, because the generated answer mentioned it. but you're ascribing this to me dismissing a detail! you are watching this feature mislead a developer in real time and blaming the developer for it.

if i'd read that in the article, i could've glanced down at the browser support table and assumed the mention was merely obsolete. but i was responding only to the text provided, which left out that important context.

this is an interesting downside to the framing of a chatbot: i know that details may have changed since an article was written, but the responses from an LLM were always written just now, which defeats that instinct.

You're right, the initial statement about both notations expressing colors in the Oklab color space is not correct, but further below it does mention the Oklch color space in the context of oklch(). I acknowledge that bit of incorrectness (and we might have a fix in the pipeline), but otherwise the response looks good to me.

[...]

Not knowing anything about these features before, I certainly found the answer helpful, almost more helpful (even if less comprehensive) than reading the articles separately.

but the question wasn't "what are oklch() and oklab()", it was "How does oklch() relate to oklab()?". the only part of the answer that really touched on their relationship was the part at the beginning claiming they pick from the same colorspace, and that's apparently untrue! but if the bar for success is merely describing a topic mentioned in the question, then i suppose this feature will rarely fail.

If one of the two pages would include that advice, then it could definitely be part of the AI Help answer, but since AI Help's answer is (intentionally) limited to the content of those MDN pages, it won't give the answer you're suggesting.

sure, that makes perfect sense. but all the framing around this feature explicitly anthropomorphizes it, asks me to think of it like a digital person answering my questions — it's "artificial intelligence", it's a "trusted companion", it's presented in a Q&A chat format. in which case it's perfectly reasonable to expect answers to questions that i wouldn't expect any one article to cover... like a comparison between two CSS functions.

on the other hand, from several of your responses here, it sounds like what you've tried to build is a lossy search engine — something that largely just looks up article contents, trims and lightly rephrases them to fit the question better, and hopefully doesn't mangle anything in the process.

perhaps the conflict here isn't between you and me, or even between people who like LLMs and people who don't, but between what you made and how you're selling it. if you're calling it "AI Help" to capitalize on the hype around OpenAI products, you can't be surprised when you also attract the criticisms of OpenAI products. (especially when it's launched at the same time as the "explain" feature, which seems to have produced much closer to context-free ChatGPT output.)

The quantitative feedback we have received so far suggests that the feature is used and we get much more "This answer is helpful" than "This answer is not helpful" votes.

As for qualitative feedback, we have only got 3 reports for AI Help answers so far [...]

i'm at a disadvantage here, since i obviously have no insight into any MDN stats. but "helpful" doesn't seem like quite the right question, when the main concern about GPT output is that it tends to be convincing regardless of whether it's true.

after all, i'm not going to leave the tab open for half an hour while i go finish building and testing something so i can come back and vote accurately; i'm going to vote immediately based on whether an answer sounds helpful. you yourself said you thought the OKLab answer was helpful, even though it didn't really answer the question asked, and even knowing it contained an incorrect detail that you didn't notice.

i don't know. if this is intended as more search engine than generated conversation, then it's less bad than i thought — but then i don't know why it's being presented as though the marketing department wants me to think of it as ChatGPT with a different stylesheet. (although it does seem to be that. i thought maybe a fine-tuned model was at least involved, but no, it appears to just glue the question to article contents and send the whole shebang to OpenAI.)

are there any privacy concerns with sending everything through the same billion-dollar company that still laughably bills itself as "open"? i don't know. the blog post doesn't even bring it up.

it's not that calling Oklab and Oklch the same colorspace is a catastrophic problem. it's that it's the sort of mild mangling that you just might get with probabilistic text transformation/generation, and i have no reason to be confident that the mangling will be mild in all cases.

and everything about this feature and the way it's presented is practically designed to defeat the normal intuitions about how to read text. sure, a search engine might show a snippet of an article, but i know that a snippet might leave out context, and the entire point (once upon a time) is for me to read the actual article. this feature tries to emulate the aesthetic of human conversation while leaving out all the expectations of human conversation — that the other party will just say when they don't know something, that they will express hesitance about details they're fuzzy on, that they will mention gotchas or other unexpected details i wouldn't think to ask for, that they're aware of the passage of time.

at best this is another step towards a world where nothing is really reliable, where anything and everything might just be lying to me a little bit about things that make no sense to lie about, where prose is just a little bit rambling and a little bit useless and a little bit wrong, where it's not clear if any of this can be fixed because it's just what the magic linear algebra machine does, and where the primary way to control it is to ask it to roleplay. and everyone is okay with that because it's cool that the computer can talk now.

Eragonfr commented 12 months ago

This seems to have been said again and again in this thread and everywhere on the internet.
An LLM is incapable of always being correct. But that's what's expected from MDN.

To use something everyone here knows : A broken clock is still correct twice a day. An LLM can be correct, sometimes, but it will still be wrong the rest of the time.

To continue with the clock analogy: The RFCs from W3C are an atomic clock, they say what time it is, they are the reference. MDN is a computer that uses the network to sync with the W3C. And the LLM is a mechanical clock, maybe broken, maybe not, but certainly not in sync with the reference.

MDN is expected to be correct and in sync with the reference, that's the reputation MDN built thanks to those writers, reviewers and volunteers. No LLM have this reputation. And as long as no LLM can be as accurate as those reviewers, writers and volunteers that made the content of MDN, the LLM have no place in MDN.

The LLM have no place here because it cannot be trusted. And I don't have the knowledge to check if what the LLM said is true or if it's bullshit. And because of that I don't use LLM, and I don't want others, with less knowledge about LLM, developers to try to use MDN. And then be disappointed because the LLM was wrong.

MDN is trustworthy, LLM are not.

obfusk commented 12 months ago

wait a second. i took the generated answer at face value and assumed that browser support was a concern, because the generated answer mentioned it. but you're ascribing this to me dismissing a detail! you are watching this feature mislead a developer in real time and blaming the developer for it.

Wow. Just wow. As I said:

It's easy to mistake a plausible, confident-sounding -- but actually misleading or incorrect -- answer for a helpful one. I see no safeguards to prevent this.

We now see exactly that scenario happening right here in this thread to multiple experienced developers (including myself as I also took that bit at face value). If even the proponents of this "feature" can be fooled by the LLM to deem a misleading answer "helpful", you have one hell of a misinformation problem. One that will destroy the credibility of MDN as a trustworthy source of information.

One of the problems with LLMs is that -- unlike people -- they have no concept of "I don't know" or "I'm not sure". Because they fundamentally cannot know or understand anything. They cannot tell the difference between something actually correct and something that just sounds like it. They will happily produce helpful-seeming garbage -- even fabricate sources that don't exist -- whilst sounding 100% confident and convincing. You can't solve that by slapping a "We don't claim that AI Help answers are 100% correct" disclaimer on it.

Edit: that last paragraph seems to have triggered a philosophical discussion regarding what LLMs are fundamentally capable of, which really isn't the point. Nor is it a useful discussion to have here. Unless someone can prove that this LLM being used for this feature will actually say "I don't know" or "I'm not sure" instead of producing plausible, confident-sounding -- but actually misleading or incorrect -- answers, we have a misinformation problem. And so far all the evidence we have points to the latter.

joepie91 commented 12 months ago

@caugner I have so far mostly avoided getting involved in this conversation, mostly following this topic from a distance. But I feel like this part needs to be said out loud: with every comment you post, my trust in Mozilla is being reduced further (and to be clear, it already wasn't in an amazing place considering previous governance missteps, but that's a different topic that is not directly relevant here).

I'm seeing the exact same conversation patterns that I'm used to seeing from abusive tech corporations; tunnel focus on (self-selected) one-bit metrics while ignoring nuanced feedback, interpreting parts of a bigger criticism as if they are a free-standing point (as that makes it easier to argue against them), seemingly strategically not addressing the concerns that are difficult to argue against, blaming the user for expectations created by the company, cherry-picking evidence, and so on, and so forth.

The introduction of this feature was one thing. I think that was a mistake and points at governance problems, but that's still a mistake, and mistakes can be corrected. Mistakes happen. What concerns me far more is the way that you are (avoiding) engaging with the many(!) well-argued points being made here; the way that you are refusing to own up to that mistake and undo it, instead doubling down and, in my opinion, discussing it in bad faith.

If this is the way that mistakes are handled within Mozilla, and this is what people can expect outward communication from Mozilla to look like, then how is anyone supposed to trust Mozilla as a steward of the web?

I'll leave the rest of the discussion around LLMs to other folks, since plenty of people are already making plenty of strong arguments, but I figured I'd call out this particular problem in the hopes that something will be done with it, lest Mozilla go under due to a trust thermocline of their own. It'd be nice to have some organization left that we can trust to safeguard the open web.

falemagn commented 12 months ago

Whilst I agree with the general sentiment that a feature that even if just occasionally spewed out wrong information cannot be considered an authoritative source of information, I am chiming in only for this bit

One of the problems with LLMs is that -- unlike people -- they have no concept of "I don't know" or "I'm not sure".

I have been told by LLMs many times "I don't know" and "I am not sure", that is not impossible for an LLM to do.

Because they fundamentally cannot know or understand anything.

That's a far fetched assumption there. There actually are arguments in favour of the fact that LLMs do indeed understand what they are saying and the input they are provided with, in a way that is not too dissimilar from that of the humans, albeit not in the same exact way.

They cannot tell the difference between something actually correct and something that just sounds like it.

They can, and they often do.

They will happily produce helpful-seeming garbage -- even fabricate sources that don't exist -- whilst sounding 100% confident and convincing.

Many humans do the same, yet we don't doubt they possess a general understanding of things.

eevee commented 12 months ago

there are also arguments that a ouija board understands what it's asked and the answers it gives.

obfusk commented 12 months ago

I have been told by LLMs many times "I don't know" and "I am not sure", that is not impossible for an LLM to do.

LLMs can say anything a human plausibly would. That's what they're made for. That doesn't mean they understand what it means, even if they manage to convince a human that they do.

falemagn commented 12 months ago

I have been told by LLMs many times "I don't know" and "I am not sure", that is not impossible for an LLM to do.

LLMs can say anything a human plausibly would. That's what they're made for. That doesn't mean they understand what it means, even if they manage to convince a human that they do.

Again, assumptions, not facts. You're just now "happily producing helpful-seeming garbage whilst sounding 100% confident".

crackwitz commented 12 months ago

let's put the general capacity of LLMs aside. This isn't a philosophical debate.

This whole discussion is about this implementation here and how it has shown that it's emitting misleading/wrong text at a frequency that subject experts (of MDN content) deem inacceptably high.

nyeogmi commented 12 months ago

This whole discussion is about this implementation here and how it has shown that it's emitting misleading/wrong text at a frequency that subject experts (of MDN content) deem inacceptably high.

I don't think anyone knows what the exact frequency is beyond that all the text it's generated so far, regardless of whether the person presenting it likes AI Help or not, was at the rough accuracy expected from all LLMs.

This is probably more helpful than literally no information, since some of the information is not incorrect.

I'm seeing the exact same conversation patterns that I'm used to seeing from abusive tech corporations; tunnel focus on (self-selected) one-bit metrics while ignoring nuanced feedback, interpreting parts of a bigger criticism as if they are a free-standing point (as that makes it easier to argue against them), seemingly strategically not addressing the concerns that are difficult to argue against, blaming the user for expectations created by the company, cherry-picking evidence, and so on, and so forth.

When I first saw AI Explain, I thought "oh god, another half-baked application of LLMs" and brought all my existing expectations of LLMs to the table. I then saw a bunch of output that confirmed "yep, another half-baked application of LLMs." It was then removed voluntarily by the author.

Well, now we've got another LLM by the same team trained on the same dataset using the same model and using an extremely similar architecture, where the prompt is the only notable difference, and they're loudly insisting "your existing knowledge of LLMs isn't useful to evaluating this -- we need concrete examples." But they've ruled out all the examples by saying "we can fix those" or "did you report those through the proper channels?" or "well, real users (i.e. not the ones on GitHub) aren't complaining."

In other words, Mozilla's apparent stance is that it would be premature to say it's a duck even though it's looking and quacking increasingly like a duck.

caugner's goalposts appear to have implicitly been moved from "I won't rule that the feature is bad unless I have examples of it performing badly" to "the people who say 'the feature is bad' based on their existing knowledge of LLMs? well, those specific people need to be posting examples of the feature performing badly which I personally agree with, or else I will complain that people are judging it without trying it and ignore all criticism from both subsections of the community."

The introduction of this feature was one thing. I think that was a mistake and points at governance problems, but that's still a mistake, and mistakes can be corrected. Mistakes happen. What concerns me far more is the way that you are (avoiding) engaging with the many(!) well-argued points being made here; the way that you are refusing to own up to that mistake and undo it, instead doubling down and, in my opinion, discussing it in bad faith.

Re "a governance problem": to be less coy about this, it looks like the situation is that basically a team has introduced two extremely unpopular features and the only way for those features to be reverted, procedurally, is for the team to agree those features are a bad idea.

One person was easier to convince than the other: fiji-flo pushed out one and then reverted it voluntarily. We've yet to see caugner revert the other one.

It also seems like the team doesn't see "revert and let's talk about whether this idea is actually good or not" as an acceptable middle ground. I notice the marketing team already promoted this feature and wonder if pulling it now, even just to talk about it, would be seen an un-endorsement that would hurt the feature's credibility. (And I wonder if that is actually an advantage of pulling it temporarily.)

NoraCodes commented 12 months ago

let's put the general capacity of LLMs aside. This isn't a philosophical debate.

Agreed, but I also don't want to lose sight of the larger issue. Yes, one problem is that this implementation - this particular integration of an LLM into MDN that will happily lie to users.

But, the solution is not "make the LLM better". It solves no real problems that hiring technical writers wouldn't solve. It is, in essence, a copyright laundering machine that saves money (by not hiring writers) at the expense of quality. It is an attempt to replace the labor of trained, knowledgeable writers with OpenAI's exploitation machine - exploitation, not just of reams of copyrighted material scraped from across the Web, but also of underpaid workers in the global south.

That is a practice that I, personally, think is unacceptable and abhorrant.

Be-ing commented 12 months ago

Re "a governance problem": to be less coy about this, it looks like the situation is that basically a team has introduced two extremely unpopular features and the only way for those features to be reverted, procedurally, is for the team to agree those features are a bad idea.

Or someone can take the content and code and host them on another domain to remove Mozilla from the project completely.

joepie91 commented 12 months ago

@Be-ing I imagine that the more important consideration for a fork of MDN would be support for it from the existing contributor base, not so much the existing code and content. Though judging from the community feedback so far, that is feeling increasingly achievable.

falemagn commented 12 months ago

OpenAI's exploitation machine - exploitation, not just of reams of copyrighted material scraped from across the Web, but also of underpaid workers in the global south.

That is a practice that I, personally, think is unacceptable and abhorrant.

Whilst I share the general concern about respecting labour and human rights, Wikipedia, Glassdoor and some Kenyans say that the average ourly salary in Kenya is way lower than what OpenAI reportedly has paid. What income alternative would you suggest those Kenyans who worked for OpenAI?

As for the copyright infringement claim: that rests to be seen and settled in courts around the world.