pulumi / pulumi-ai

222 stars 15 forks source link

Pulumi AI is poisoning Google search results with AI answers #79

Closed petetnt closed 1 month ago

petetnt commented 3 months ago

What happened?

Today I was googling various infrastructure related searches and noticed a worrying trend of Pulumi AI answers getting indexed and ranking high on Google results, regardless of the quality of the AI answer itself or if the question involved Pulumi in the first place. This happened with multiple searches and will probably get even worse as the time goes on.

Example

For example search AWS Lightsail xray, brings up this AI Answer from Pulumi as the top result:

image

Link to the AI Answer: https://www.pulumi.com/ai/answers/bLHAi4DutXJvbJyNngGRvS/optimizing-aws-lightsail-and-x-ray-deployment.

While this might seem like a good thing for someone, spamming high ranked results that are at best misleading and at worst destructive does not seem like something I would want to associate Pulumi as a brand. There's already tons of generated, false content available on the internet and adding even more noise to the search results is not a good idea.

I would highly recommend Disallow:ing robots from scraping https://www.pulumi.com/ai/answers via Robots.txt or similar functionality.

Additional context

Adding -inurl:[pulumi.com/ai] to your query will remove Pulumi AI answers from the search results, but it's cumbersome.

Contributing

Vote on this issue by adding a 👍 reaction.

AaronFriel commented 3 months ago

Hey @petetnt, we've taken this feedback from you and others and we've taken steps to remove more than half (almost two thirds) of AI Answers, and we plan to continue to ensure that these AI answers are complementary to our existing documentation.

We are also taking steps to:

cnunciato commented 3 months ago

we've taken steps to remove more than half of AI Answers

Worth mentioning this list was submitted to Google this morning, so it could be a bit before they're removed from search results. We expect this to happen fairly soon, though.

petetnt commented 3 months ago

Thank you @AaronFriel and @cnunciato for the prompt (sic) and solid response 🫡

petetnt commented 2 months ago

This was trending on my Twitter feed today, so it's pretty safe to assume that the situation is dire still https://twitter.com/ProgrammerDude/status/1784833971731223033

Camel commented 2 months ago

It honestly makes using Pulumi itself very challenging... hard to find valid answers on how to do something b/c the pulumi ask AI generated ones crowd the results and if you try them they don't actually work. And for some time (at least as of 2-3 weeks ago), the links to Pulumi's site for these generated results were 404'ing.

Appreciate there's a GTM genefit to this SEO work, but at least for me it cut the opposite direction. Wanted to use Pulumi but this was such a pain point I just stuck with Terraform.

mbomb007 commented 2 months ago

Hey @petetnt, we've taken this feedback from you and others and we've taken steps to remove more than half (almost two thirds) of AI Answers, and we plan to continue to ensure that these AI answers are complementary to our existing documentation.

It doesn't sound like robots.txt was changed. Removing answers isn't going to fix the issue if LLM-generated answers are still available in search results.

However, for people that don't like Google, the issue will probably help provide alternatives to Google gain market share.

daaain commented 2 months ago

You ABSOLUTELY MUST add a report button on these pages at the very least, ASAP!

If somebody asks a question about stuff that doesn't exist, the LLM will hallucinate it, it'll rank high in searches (as no one else will have written about the solution that isn't possible) and it'll confuse the hell out from whoever finds it!

I'm pretty knowledgeable about GCP (actually have the GCP Professional Cloud Architect certification) but I was chasing the wrong idea down that it would be possible to pre-create a Cloud Function with Pulumi and then use gcloud to deploy from source. I can't find those pages any more, but while looking for them found 2 wrong answers randomly:

https://www.pulumi.com/ai/answers/7Kzx1a8vhPuAX6yYjEpeG3/deciphering-google-cloud-artifact-registry-and-cloud-functions-v2-integration https://www.pulumi.com/ai/answers/gpu3nhpaDXc7gDC5hG61PZ/deploying-gke-and-cloud-functions-on-google-cloud

cnunciato commented 2 months ago

@daaain Thank you for pointing this out! I actually thought we were doing this already. I just opened a PR to add the same feedback widget we use elsewhere in Pulumi AI.

cnunciato commented 2 months ago

I just opened a PR to add the same feedback widget we use elsewhere in Pulumi AI.

The PR's been merged and the site's been updated. Thanks again for the report!

cnunciato commented 2 months ago

It doesn't sound like robots.txt was changed.

@mbomb007 We actually did update our robots.txt file to point to a sitemap we built to tell Google about unpublished pages, all of which return HTTP 410 with <meta name="robots" content="noindex"> directives. (This is the sitemap that was submitted to Google on March 22.) This combination (HTTP 410 + noindex) is the strongest signal we know of to tell Google that these pages are gone and aren't coming back. It's unclear why it's taking so long for Google to remove them.

daaain commented 2 months ago

The PR's been merged and the site's been updated. Thanks again for the report!

Amazing turnaround time, thanks a lot for your hard work on this today!

I'll make sure to flag nonsense generated code when I find more (did on the 2 pages I linked above with explanation).

tobytteh commented 2 months ago

This use of AI is monstrously stupid. Let us all pray to our respective gods that someone at Pulumi is intelligent enough to end this.

mbomb007 commented 1 month ago

It doesn't sound like robots.txt was changed.

@mbomb007 We actually did update our robots.txt file to point to a sitemap we built to tell Google about unpublished pages, all of which return HTTP 410 with <meta name="robots" content="noindex"> directives. (This is the sitemap that was submitted to Google on March 22.) This combination (HTTP 410 + noindex) is the strongest signal we know of to tell Google that these pages are gone and aren't coming back. It's unclear why it's taking so long for Google to remove them.

You could maybe speed reindexing up using Google Search Console?

AaronFriel commented 1 month ago

@mbomb007 we did that as well, submitting the "unpublished" sitemap and the console reported it scanned (IIRC, it did not say "crawled") those pages. Our last resort has been using a tool that allows us to remove up to 1,000 URLs per day from Google's index, but it is fairly manual.

petetnt commented 1 month ago

Why not add <meta name="robots" content="noindex"> meta tag to all the pages under /ai path? Or explicitly add Disallow: /ai/* to the robots.txt at https://www.pulumi.com/robots.txt

mbomb007 commented 1 month ago

@AaronFriel This might help: https://developers.google.com/search/docs/crawling-indexing/block-indexing#debugging-noindex-issues

AaronFriel commented 1 month ago

@petetnt @mbomb007 Thanks, we've already taken some of those steps and we've added the meta tag for the pages we want to remove that didn't meet our quality bar and were cluttering search results (affecting roughly 2/3 of the pages we published.)

I think the point you're getting at is: why publish AI Answers at all? In short: we've gotten very positive feedback users when the pages show up appropriately and don't clutter the first page of results in a search engine. We don't want to throw out the good with the bad, and we've marked those pages as noindex. The URL Inspection tool reports what we expect since setting these pages to 404:

Google URL Inspection tool reporting a page as 404ed

I'll speak to @tobytteh's comment here, which I think captures the frustration folks have and the underlying question of why we Pulumi feels comfortable generating code examples with AI:

This use of AI is monstrously stupid. Let us all pray to our respective gods that someone at Pulumi is intelligent enough to end this.

Code generation is Pulumi's bread and butter, it is a core competency of our engineering org. Every one of our providers has a rich schema describing the SDK (example Docker provider schema.json). Those schemas are then used to generate the SDKs for each language (source code in github.com/pulumi/pulumi/pkg/codegen). Pulumi AI combines this with retrieval augmented generation and type checking of generated programs to be more than ten times more likely to generate valid, working code for many questions than ChatGPT (GPT-4) on its own. Generating ten times better code than state of the art language models is itself a feat, but we aren't resting on our laurels and we're continuing to work on setting an even higher bar for ourselves to ensure that every program we publish would work from copy-and-paste to pulumi up.

That said, we certainly didn't expect the result of publishing as many pages as we did, and that's why we've taken drastic steps to withdraw a significant number (around 2/3) of the AI Answers. We'll continue to raise the bar on quality and prune pages that do not meet our standards.

petetnt commented 1 month ago

Personally I don’t think the (alledged) 2/3rds is nearly good enough ratio to spam the internet full of absolutely wrong answers. Not to mention the page mentioned in the first post is still up and indexed for example, which makes me think that you are willing to risk it for a piece of the much obsessed AI pie.

For example publishing an index of valid answers while keeping the actual index out the results would probably satisfy those looking for AI answers too.

AaronFriel commented 1 month ago

In the interest of transparency, I'm happy to set up a call to chat and prove that 2/3 figure. Email me at my last name at pulumi.com.

That said, there are three issues here:

  1. Opposition to AI generated content, full stop
  2. Low quality code examples/documentation
  3. Cluttering of search results

While I see the pros and cons of 1, the issues we want to solve are 2 and 3. If you have examples where the example is "absolutely wrong", that falls under 2, so please create issues or use the feedback buttons to let us know.

AaronFriel commented 1 month ago

Thanks everyone for your feedback. In February when we saw the impact Pulumi AI Answers had on search result quality, we started work on solutions and we’re now seeing dramatic improvements from the work we’ve done:

The good news is this has been effective! We’re pleased to see search engines use these signals to place our authoritative, expert-written docs content first.

Pulumi AI is still providing a ton of value to users - we're seeing thousands of questions asked and answered every day, helping devs build faster on any cloud. With quality checks in place and search results cleaned up, we’ve made Pulumi AI a better resource that is more correctly ranked relative to our other docs such as our Registry API Docs. And we will keep iterating on these improvements to documentation, code generation and verification of AI generated content.

We’ll close this issue as resolved, and thanks again for pushing us to make Pulumi better.

petetnt commented 1 month ago

Sadly for me the original issue persist, with the example in the OP still being one of the many answers that provide me with negative value, so I guess I’ll just consider this more a ”wontfix” than completed.

joeduffy commented 1 month ago

Bizarrely, this is one of the only queries that seems to still rank so highly. It honestly baffles me why Google ranks this page above everything else. I've tried numerous others - and we've validated that most of the traffic Google is sending to these pages - has died down considerably. We will keep monitoring and iterating.

That said, for what it's worth, the example on this page works! Is there a particular reason it isn't perceived as a reasonable page to have on the Internet? I'm not an AWS Lightsail expert, so apologies if I'm missing something obvious.