Open idnovic opened 1 year ago
My second test set is https://www.bsi.bund.de/DE/Home/home_node.html It has only around 1% the size of the microsoft docs set. I will give feedback after it finished indexing.
Hi @idnovic - thank you for your feedback, I'm sorry I missed it, so, sorry for the delay.
Correct me if I'm wrong, but it sounds like what you're looking for are "filters" for what is indexed (or what is ignored) for a specific domain. In your example, I assume it would involve only indexing items with a "de-de" in their path. Am I understanding your requirement correctly?
Hi @idnovic - thank you for your feedback, I'm sorry I missed it, so, sorry for the delay.
Correct me if I'm wrong, but it sounds like what you're looking for are "filters" for what is indexed (or what is ignored) for a specific domain. In your example, I assume it would involve only indexing items with a "de-de" in their path. Am I understanding your requirement correctly?
No problem. I did read that it is alpha software. I did not expect an answer right away.
Well, maybe. Filters do sound useful. For my test with the microsoft docs the url path itself already contained the german language.
It seems that bloo found the links for other translations in the site meta-data. And tried to index every language.
I think it would be best if try it yourself. To see what I mean.
My second test with the site bsi was not successful. Bloo was not able to index it. Maybe rate limiting or a parsing issue.
Thanks for the feedback; I think I understand the issue you're having well - you'd like to index only a specific language in the site - I guess my thinking is more about what would be the best solution that would both solve your issue and also make it as useful as possible in as many other situations/people as possible, but without being too general, if you get what I mean :)
In your mind what kind of feature/option would be the ideal solution for your issue?
I think that bloo should first create a list of all domains it wants to index. Present me this list sorted by domain/subdomain/directory.
I think this list needs to be in a tree view.
Let me disable branches.
That Way it would work for most websites and I could disable every branch of an other language.
I am just not sure how you can precreate the URL list without loading every sub-side.
Indeed, there is a chicken and egg issue there - although for some sites reading the sitemap could provide a starting point for something like this, but it wouldn't work in a consistent way across sites. In a way, a regex-based filter would accomplish the same thing as links come in (in fact, Bloo will do this already based on rules specified in the robots.txt file) - so perhaps a way for the user to add "extra" rules for a domain could make sense?
I think a single rule may already improve this situation.
Let’s keep the microsoft docs situation in mind. We are on the 1. url and tell bloo to index. Bloo finds 2. url via meta-data. This seems to be a logical error. Because it is traversing upwards (from 1. to 2.) instead of downwards (from 1. to 3.). If I go to any website and make an effort to find indexable content, than I probably do not want content upwards of my choosen directory.
I think custom rules may be to complex because bloo does not tell me why it wants to index something. But basic rules I can switch on/off to change the indexing seem to be a good idea.
Good switchable rules that come to mind are traversal depth traversal direction minimum number of words to index (page) maximum number of words to index (page) cross domain index
Also since bloo runs mostly on apple platform it may be possible to detect the page language and offer a setting to only index pages of certain languages.
Hi, sorry for the slow reply, and hope you're having a happy holiday season. Those options do indeed sound very useful, so if I understand correctly we'd have:
Can you tell me more about cross-domain indexing? Do you mean e.g. https://support.apple.com/de/data.html
would also allow https://developer.apple.com/de/other_data.html
? If so, how would the option work? Perhaps some wildcard? e.g. https://*.apple.com/de
for instance?
Thanks again for these suggestions, I'll definitely be working on the first three, they sound very useful - when you find some time (no hurry) let me know more about cross-domain stuff.
Can you tell me more about cross-domain indexing? Do you mean e.g.
https://support.apple.com/de/data.html
would also allowhttps://developer.apple.com/de/other_data.html
? If so, how would the option work? Perhaps some wildcard? e.g.https://*.apple.com/de
for instance?
No problem. Holidays are more important than GitHub. By cross-domain indexing I was thinking about articles that provide sources at the end of the article. Think of wikipedia. I may want to index wikipedia.com/cheesecake but I may also want to index all given sources at the end of the page. These sources are probably external.
(Let´s rename cross-domain to "follow sources") With "follow sources" turned on I would expect the following:
An other example would be an article at a news website. I am reading about dogs and the article lists external sources in the markup. If "follow sources" is enabled then also index the given source.
Important: Only follow sources from the article itself. Not from other components / divs of the website.
Just a note to let you know I haven't forgotten about this :) Time is very limited these days but I suspect the eventual solution will be to allow a custom extension to be added to the domain's "robots.txt", so for instance you can add extra rules, such as only allowing certain paths, or URLs that match specific regexes, etc. (Apologies in case you're no longer using Bloo, just ignore me then :) But I still want to implement this in some form as it's a great suggestion, and I want to use it too :))
Ok, the initial hack is in the upcoming build 55. For example, if you want to add access rules to https://developer.apple.com
so it only indexes the /de
section, you can do this:
local-robots.txt
in ~/Library/Containers/build.bru.bloo.app/Data/Documents/storage.noindex/developer.apple.com/local-robots.txt
robots.txt
file. In this example:User-agent: _bloo_local_domain_agent
Disallow: /
Allow: /de/
If you refresh, or pause and un-pause the indexing of that domain, it should no longer access URLs outside this scope. Note that if you already have indexed items outside this scope, they will only be removed if a full refresh of the domain is made.
Note that _bloo_local_domain_agent
must be the name of the agent, anything else will get ignored.
I'm going to improve this with a bit of GUI when possible, but I wanted to get the ability to do this in initially so if you really need it, you can add that text file and get started. It supports all the syntax that a regular agent definition in a robots.txt file has, so if it can be done in a regular robots.txt file it can be done here too.
I hope this helps and sorry for taking ages to reply!
Greetings,
I am testing bloo on https://learn.microsoft.com/de-de/docs/
But it wants to index every language of the docs and not just german. I am aware that I choose a large data set. I still wanted to ask for optimisations.
Bloo tries to index the other languages over the sitemap xml files. Would it be possible to exclude other languages since the naming is somewhat standardised eg de_de en_us etc... and the languages are part of the sitemap xml files filenames.
I think if bloo would be able to handle microsoft docs it should be able to handle everything. I am open to other suggestions.