Closed Marc-pi closed 9 years ago
upload dir can be open, But I don't know about Googlebot-Video, Googlebot-News, Mediapartners , if needed we can add support
yep,for me the images dir is not a big issue, the biggest impact is on the access to JS and CSS files
Can you please make example test robotx?
from the above google link (technical guidelines)
` We recently announced that our indexing system has been rendering web pages more like a typical modern browser, with CSS and JavaScript turned on. Today, we're updating one of our technical Webmaster Guidelines in light of this announcement.
For optimal rendering and indexing, our new guideline specifies that you should allow Googlebot access to the JavaScript, CSS, and image files that your pages use. This provides you optimal rendering and indexing for your site. Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings. `
those are also detected by the MS tool for Bing/Yahoo , so it is not google only
Can you please make example test robotx?
the question is more the sensitiveness of those directories
/public/vendor/
/asset/~>yourthemecss&js&images
/asset/~>yourmodulecss&js
/script (captcha)
/static/vendor/jquery
/static/avatar/~>image
/upload~>modules/images directories
i'll propose a new version of the robots.txt, let me time to dig into those dir to see what's in
I vote for this version open upload folder and keep other paths as custom test on our websites, on Pi 2.6 we can have new version of robotx
well, i've investigated the above dir, for me we must allow those :
/upload
/static
/public
/asset
and perhaps /script
caution : we have missing index files in some directories, i'll open an issue on this
We open all paths , I think it not very good idea on general system. for this version keep it without change or just open upload, for others more test needed
We open all paths , I think it not very good idea on general system.
LOL, it's already the case in the default robotx.txt, for several crawlers !!!!
==> i'll go with this, installed already on EDQ
User-agent: *
# JS/CSS
Allow: /static/*.css
Allow: /static/*.js
Allow: /public/*.css
Allow: /public/*.js
Allow: /asset/*.css
Allow: /asset/*.js
Allow: /script/*.css
Allow: /script/*.js
# Disallow
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /asset/
Disallow: /module/
Disallow: /public/
Disallow: /script/
Disallow: /setup/
Disallow: /static/
User-agent: Mediapartners-Google
Allow: /
User-agent: Adsbot-Google
Allow: /
User-agent: Googlebot-Image
Allow: /
User-agent: Googlebot-Mobile
Allow: /
User-agent: Mediapartners-Google
Allow: /
User-agent: Adsbot-Google
Allow: /
User-agent: Googlebot-Image
Allow: /
Allow: /upload/
User-agent: Googlebot-Mobile
Allow: /
@voltan in GWT > Fetch as google
> enter an url and you'll see the elements that block the crawler (click on status result => in case some elements cannot be seen by the crawler, you have the Partial status)
And https://github.com/pi-engine/pi/issues/1325 if we add index on folders google can not see information's on them
what is your final suggestion ? add
Allow: /static/*.css
Allow: /static/*.js
Allow: /public/*.css
Allow: /public/*.js
Allow: /asset/*.css
Allow: /asset/*.js
Allow: /script/*.css
Allow: /script/*.js
?
I ask some one , its seams , it not very easy, and have to many different rules , for this version I think we can set it like :
User-agent: *
# Disallow
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /asset/
Disallow: /module/
Disallow: /public/
Disallow: /script/
Disallow: /setup/
Disallow: /static/
User-agent: Mediapartners-Google
Allow: /
User-agent: Adsbot-Google
Allow: /
User-agent: Googlebot-Image
Allow: /
User-agent: Googlebot-Mobile
Allow: /
See EDQ. I think I will block for all like before and then open with my actual allow rules only to Bing and Googlebot that are safe
For default installer it good. we can add others as custom edit on websites. I think we can close it
Browser and crawlers are using robots.txt informations. Latests years, those improved the way they need some information to be accessible :
This impacts :
References :
==> if you use this MS latest tool, you get those Violation warnings
Our actual default Robots.txt file
Googlebot-Video
,Googlebot-News
,Mediapartners
(#Mediapartners-Google
) => those are missing1/ We have for sure at least to update the default robots Disallow list for CSS/JS and perhaps images and some crawlers. A good robotx.txt file must be also light (you can read also http://www.elegantthemes.com/blog/tips-tricks/how-to-create-and-configure-your-robots-txt-file) i guess it will fix also th FA icons not being displayed by some users @taiwen @voltan => your thoughts about the dir to open ?
2/ But i wonder if we have a more structural changes to make since those CSS/JSS dir resources are located in different directories @taiwen : ...without having regression on running Pi on several servers/instances/sass mode (those dir must be crawlers accessible but also protected by index.html file)