kiwix / kiwix-tools

Command line Kiwix tools: kiwix-serve, kiwix-manage, ...
https://download.kiwix.org/release/kiwix-tools/
GNU General Public License v3.0
428 stars 83 forks source link

Cannot index the kiwix-serve pages using Yacy search engine. #527

Closed ISJ-439 closed 2 years ago

ISJ-439 commented 2 years ago

Hello,

When loading a new page to crawl, which is a locally hosted page on the same server, I'm getting a error 503. However using Lynx, a cli web browser for linux I can load it fine.

How would i troubleshoot this further?

Yacy Console: https://i.imgur.com/IwpjkuS.png

Kiwix Server Logs:

======================
Requesting : 
full_url  : /w/load.php
method    : GET (0)
version   : HTTP/1.1
request#  : 50
headers   :
 - accept : '*/*'
 - accept-encoding : 'gzip, deflate'
 - accept-language : 'en-US,en;q=0.9'
 - connection : 'keep-alive'
 - dnt : '1'
 - host : 'example.com:1234'
 - referer : 'http://example.com:1234/2016_-_wikipedia_en_all_2016-02/A/Main_Page.html'
 - sec-gpc : '1'
 - user-agent : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
arguments :
 - debug : 'true'
 - lang : 'en'
 - modules : 'jquery,mediawiki'
 - only : 'scripts'
 - skin : 'vector'
 - version : 'vWmIJl0K'
Parsed : 
full_url: /w/load.php
url   : /w/load.php
acceptEncodingDeflate : 1
has_range : 0
is_valid_url : 1
.............
** running handle_content
Response :
httpResponseCode : 404
headers :
 - Content-Type: 'text/html'
 - Access-Control-Allow-Origin: '*'
 - Cache-Control: 'no-cache, no-store, must-revalidate'
 - Content-Encoding: 'deflate'
 - Vary: 'Accept-Encoding'
Request time : 0.000758s
----------------------
======================
Requesting : 
full_url  : /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
method    : GET (0)
version   : HTTP/1.1
request#  : 51
headers   :
 - accept : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
 - accept-charset : 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'
 - accept-encoding : 'gzip'
 - accept-language : 'en-us,en;q=0.5'
 - connection : 'close'
 - host : 'example.com'
 - user-agent : 'yacybot (intranet-local; amd64 Linux 5.10.0-10-amd64; java 1.8.0_242; Etc/en) http://yacy.net/bot.html'
arguments :
Parsed : 
full_url: /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
url   : /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
acceptEncodingDeflate : 0
has_range : 0
is_valid_url : 1
.............
** running handle_content
Found A/Main_Page.html
mimeType: text/html
Response :
httpResponseCode : 200
headers :
 - Content-Type: 'text/html'
 - Access-Control-Allow-Origin: '*'
 - ETag: '"1643267520242259327/c"'
 - Cache-Control: 'max-age=2723040, public'
Request time : 0.000456s
----------------------
======================
Requesting : 
full_url  : /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
method    : GET (0)
version   : HTTP/1.1
request#  : 52
headers   :
 - accept : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
 - accept-charset : 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'
 - accept-encoding : 'gzip'
 - accept-language : 'en-us,en;q=0.5'
 - connection : 'close'
 - host : 'example.com'
 - user-agent : 'yacybot (intranet-local; amd64 Linux 5.10.0-10-amd64; java 1.8.0_242; Etc/en) http://yacy.net/bot.html'
arguments :
Parsed : 
full_url: /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
url   : /2016_-_wikipedia_en_all_2016-02/A/Main_Page.html
acceptEncodingDeflate : 0
has_range : 0
is_valid_url : 1
.............
** running handle_content
Found A/Main_Page.html
mimeType: text/html
Response :
httpResponseCode : 200
headers :
 - Content-Type: 'text/html'
 - Access-Control-Allow-Origin: '*'
 - ETag: '"1643267520242259327/c"'
 - Cache-Control: 'max-age=2723040, public'
Request time : 0.000472s
----------------------
======================
Requesting : 
full_url  : /robots.txt
method    : GET (0)
version   : HTTP/1.1
request#  : 53
headers   :
 - accept : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
 - accept-charset : 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'
 - accept-encoding : 'gzip'
 - accept-language : 'en-us,en;q=0.5'
 - connection : 'close'
 - host : 'example.com'
 - user-agent : 'yacybot (intranet-local; amd64 Linux 5.10.0-10-amd64; java 1.8.0_242; Etc/en) http://yacy.net/bot.html'
arguments :
Parsed : 
full_url: /robots.txt
url   : /robots.txt
acceptEncodingDeflate : 0
has_range : 0
is_valid_url : 1
.............
** running handle_content
Response :
httpResponseCode : 404
headers :
 - Content-Type: 'text/html'
 - Access-Control-Allow-Origin: '*'
 - Cache-Control: 'no-cache, no-store, must-revalidate'
Request time : 0.000521s
----------------------
kelson42 commented 2 years ago

@ISJ-439 Thank you for your bug report. Unfortunately I don't understand it.

Please answer to the following questions:

ISJ-439 commented 2 years ago

Hello,

Sorry it's not clear. I also wanted to say, the only reason why I dont use the built in search is due tot he sheer volume of the library, its just too large at 990GB and 16 separate ZIM files. Yacy is also a lot better at fuzzy searching.

What is the version of kiwix-serve you use? Which system? 3.1.2

What is the detailed step-by-step procedure to reproduce the bug?

If you we're to do this by yourself:

  1. On a Debian 11 system.
  2. 
    sudo apt install kiwix kiwix-tools docker.io

sudo mkdir /opt/yacy_data sudo chmod 777 /opt/yacy_data

https://github.com/yacy/yacy_search_server/blob/master/docker/Readme.md

Default admin account

login: admin

password: yacy

You should modify this default password with page /ConfigAccounts_p.html when exposing publicly your YaCy container.

CONFIG

Use Case & Accounts

Basic Configuration

  1. Search portal for your own web pages
  2. Uncheck SSL and UPnP

Accounts

Admin Account

Select: Access only with qualified account Peer User: adminuser Set the passwords

Network Configuration

Distributed Computing Network for Domain

Select: Robinson Mode Select: Private Peer

RAM/Disk Usage & Updates

Web Cache

HTCache Configuration

The maximum size of the cache: 50MB Compression level: 0

Access Tracker

Server Access

Local Search access rate limitations

YaCy search

Max searches in 3s: 3 Max searches in 1mn: 30 Max searches in 10mn: 300

Portal Configuration

Generic Search Portal

Greeting Line: Search the archived copies of Wikipedia for removed or changed articles. URL of Home Page: http://example.com:8090/ Index remote results: uncheck (this system is for searching the crawled pages only)

sudo docker run -d --name yacy -p 8090:8090 -p 8443:8443 -v /opt/yacy_data:/opt/yacy_search_server/DATA --log-opt max-size=200m --log-opt max-file=2 yacy/yacy_search_server:latest

KIWIX Server

sudo adduser kiwixuser

wget any of these files to use https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/

su - kiwixuser -c 'kiwix-serve --library /home/kiwixuser/example.zim --port 5555 --nosearchbar --daemon'



>What do you get?
An error 503 on a crawl but it works fine from 

>What do you expect?
It to crawl the zim file and index the text within for later searches.
kelson42 commented 2 years ago

To me its sounds this should be a bug report for Yacy, which I have no clue about. If everything works fine without Yacy, then it should be investigated in that direction. The best I can imagine is that the Yacy is flooding the server which at some point is not able to answer properly anymore because of lack of resources. The fulltext search can be pretty resource intensive in particular. To start investigation I should better have a reproduction case without Yacy.

ISJ-439 commented 2 years ago

Thanks, the fact it's returning a 503 is why I posted it here, as that's typically a server related error.

kelson42 commented 2 years ago

@ISJ-439 Secure that your crawler does not make more than one request per second. If you still have errors, share the corresponding log please.

ISJ-439 commented 2 years ago

@kelson42 That may be it, any other anti-abuse settings to avoid?

kelson42 commented 2 years ago

No, you can use the "--threads" option to increase the througput.

ISJ-439 commented 2 years ago

For future googles, this was an issue with other software parsing the pages without adjusting the custom ports while doing so.

kelson42 commented 2 years ago

@ISJ-439 Without putting in question your choice of Yaci to make fulltext searches, feel free to open a ticket or two if the current fulltext search engine (Xapian engine) misses a specific feature. We are not over with improving Kiwix search capabilities and user articulated feedbacks are therefore very valuable.

ISJ-439 commented 2 years ago

@ISJ-439 Without putting in question your choice of Yaci to make fulltext searches, feel free to open a ticket or two if the current fulltext search engine (Xapian engine) misses a specific feature. We are not over with improving Kiwix search capabilities and user articulated feedbacks are therefore very valuable.

Done sir/miss: https://github.com/kiwix/kiwix-tools/issues/528