Browser perf & SEO > Robotx.txt and JS/CSS crawler access

Marc-pi commented 9 years ago

Browser and crawlers are using robots.txt informations. Latests years, those improved the way they need some information to be accessible :

Bing/Yahoo and Google need a proper access to JS and CSS ressources, to see your pages like users see.

This impacts :

browser display (over the fold display time and sometime some resources display like FA icons)
crawlers speed
SEO rankings

References :

http://googlewebmastercentral.blogspot.fr/2014/10/updating-our-technical-webmaster.html
i guess it is the same for baidu since it is also a modern search engine
http://searchenginewatch.com/sew/how-to/2356447/managing-your-robotstxt-file-effectively
http://www.microsoft.com/web/seo/ => Bing/Yahoo

==> if you use this MS latest tool, you get those Violation warnings

/public/vendor/
/asset/~>yourthemecss&js&images
/asset/~>yourmodulecss&js
/script (captcha)
/static/vendor/jquery
/static/avatar/~>image
/upload~>modules/images directories

Our actual default Robots.txt file

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /asset/
Disallow: /module/
Disallow: /public/
Disallow: /script/
Disallow: /setup/
Disallow: /static/
Disallow: /upload/

User-agent: Mediapartners-Google
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Image
Allow: /
Allow: /upload/

User-agent: Googlebot-Mobile
Allow: /

Images are accessible to google-bot image but not the search engine => i guess we can open /upload to crawlers, no?
what about those bots ? Googlebot-Video, Googlebot-News, Mediapartners (# Mediapartners-Google) => those are missing
we are too restrictive at blocking indexation for Bing/Yahoo/Baidu , and more opened to Google => need to improve

This has a huge impact on SEO rankings and pagespeed rankings & some browser display impact

1/ We have for sure at least to update the default robots Disallow list for CSS/JS and perhaps images and some crawlers. A good robotx.txt file must be also light (you can read also http://www.elegantthemes.com/blog/tips-tricks/how-to-create-and-configure-your-robots-txt-file) i guess it will fix also th FA icons not being displayed by some users @taiwen @voltan => your thoughts about the dir to open ?

2/ But i wonder if we have a more structural changes to make since those CSS/JSS dir resources are located in different directories @taiwen : ...without having regression on running Pi on several servers/instances/sass mode (those dir must be crawlers accessible but also protected by index.html file)

voltan commented 9 years ago

upload dir can be open, But I don't know about Googlebot-Video, Googlebot-News, Mediapartners , if needed we can add support

Marc-pi commented 9 years ago

yep,for me the images dir is not a big issue, the biggest impact is on the access to JS and CSS files

voltan commented 9 years ago

Can you please make example test robotx?

Marc-pi commented 9 years ago

from the above google link (technical guidelines)

` We recently announced that our indexing system has been rendering web pages more like a typical modern browser, with CSS and JavaScript turned on. Today, we're updating one of our technical Webmaster Guidelines in light of this announcement.

For optimal rendering and indexing, our new guideline specifies that you should allow Googlebot access to the JavaScript, CSS, and image files that your pages use. This provides you optimal rendering and indexing for your site. Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings. `

those are also detected by the MS tool for Bing/Yahoo , so it is not google only

Marc-pi commented 9 years ago

Can you please make example test robotx?

the question is more the sensitiveness of those directories

/public/vendor/
/asset/~>yourthemecss&js&images
/asset/~>yourmodulecss&js
/script (captcha)
/static/vendor/jquery
/static/avatar/~>image
/upload~>modules/images directories

i'll propose a new version of the robots.txt, let me time to dig into those dir to see what's in

voltan commented 9 years ago

I vote for this version open upload folder and keep other paths as custom test on our websites, on Pi 2.6 we can have new version of robotx

Marc-pi commented 9 years ago

well, i've investigated the above dir, for me we must allow those : /upload /static /public /asset and perhaps /script

caution : we have missing index files in some directories, i'll open an issue on this

voltan commented 9 years ago

We open all paths , I think it not very good idea on general system. for this version keep it without change or just open upload, for others more test needed

Marc-pi commented 9 years ago

We open all paths , I think it not very good idea on general system.

LOL, it's already the case in the default robotx.txt, for several crawlers !!!!

==> i'll go with this, installed already on EDQ

User-agent: *
# JS/CSS
Allow: /static/*.css
Allow: /static/*.js
Allow: /public/*.css
Allow: /public/*.js
Allow: /asset/*.css
Allow: /asset/*.js
Allow: /script/*.css
Allow: /script/*.js
# Disallow
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /asset/
Disallow: /module/
Disallow: /public/
Disallow: /script/
Disallow: /setup/
Disallow: /static/

User-agent: Mediapartners-Google
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Image
Allow: /

User-agent: Googlebot-Mobile
Allow: /

i don't use the $ wildcard because we have several css?212552 and jd?64564564 => i dunno know what are those numbers
Q = why are we authorizing everything to those ? i'm sure we have to simplify here too

User-agent: Mediapartners-Google
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Image
Allow: /
Allow: /upload/

User-agent: Googlebot-Mobile
Allow: /

Marc-pi commented 9 years ago

@voltan in GWT > Fetch as google > enter an url and you'll see the elements that block the crawler (click on status result => in case some elements cannot be seen by the crawler, you have the Partial status)

voltan commented 9 years ago

And https://github.com/pi-engine/pi/issues/1325 if we add index on folders google can not see information's on them

what is your final suggestion ? add

Allow: /static/*.css
Allow: /static/*.js
Allow: /public/*.css
Allow: /public/*.js
Allow: /asset/*.css
Allow: /asset/*.js
Allow: /script/*.css
Allow: /script/*.js

?

voltan commented 9 years ago

I ask some one , its seams , it not very easy, and have to many different rules , for this version I think we can set it like :

User-agent: *
# Disallow
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /asset/
Disallow: /module/
Disallow: /public/
Disallow: /script/
Disallow: /setup/
Disallow: /static/

User-agent: Mediapartners-Google
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Image
Allow: /

User-agent: Googlebot-Mobile
Allow: /

Marc-pi commented 9 years ago

See EDQ. I think I will block for all like before and then open with my actual allow rules only to Bing and Googlebot that are safe

voltan commented 9 years ago

For default installer it good. we can add others as custom edit on websites. I think we can close it

pi-engine / pi

Browser perf & SEO > Robotx.txt and JS/CSS crawler access #1324