gigablast / open-source-search-engine

Nov 20 2017 -- A distributed open source search engine and spider/crawler written in C/C++ for Linux on Intel/AMD. From gigablast dot com, which has binaries for download. See the README.md file at the very bottom of this page for instructions.
Apache License 2.0
1.53k stars 439 forks source link

About Gigablast Options #126

Open bubul01 opened 7 years ago

bubul01 commented 7 years ago

Gigablast is a good search engine, but there's a lack of options.

In fact, i've tried yacy (p2p search engine http://yacy.net/) but it's performance are really poor, but there's a lot of good options, like add sitemaps, add rss directly with option to reload every X time, same for website, with more options to crawl (crawl linked websites, ...) , page to see all website and pages crawled, option to add open search website: every time a search is made, it search to opensearch websites to add more results, etc.

A very good search engine will be gigablast with yacy options !

I don't know enough programming to add these options.

martinvahi commented 7 years ago

I'm not a Gigablast developer and I'm not sure, whether my current comment helps, but the Gigablast has a really nice API. If my instance, which is the fruit of an experiment that is described at my blog (archival copy) is up at the time, when You try this link, then You should be able to experiment with the

&format=json&n=20

part of the link, which can take ~30s to complete the query. It's a matter of taste, but I would try to modularize applications as much as possible, preferably keep different parts of an application even in different operating system processes (archival copy).

I was about to add a comment to the #118 that the Gigablast probably needs a thorough overhaul, starting with the introduction of the OpenMP, then may be replacing the custom database implementation with some actual database, may be Firebird, because it is, supposedly, robust and has a closed-source-compatible license and then the whole C++ code should also be formally verified and tested for memory errors. Inspiration source_1, source_2.

Thank You for reading my comment :-)

lisawebcoder commented 4 years ago

i am trying to run a script that i suceessfully run for a wikipedia API endpoint call but i can make it work for the gigablast API do you have a working webapp API script for GIGABLAST?

Lisa and thank you

lisawebcoder commented 4 years ago

hello i am trying to run a webappscript with a call to the GB API but it fails i am able to run wikipedia API calls in a webapp but not this API so far does anyone have a basic script with user input that takes a query and calls the API for the data answers

LISA

martinvahi commented 4 years ago

@lisawebcoder

... do you have a working webapp API script for GIGABLAST? ...

Let's just say that I am greatful for the http://gigablast.com/ author to make the Gigablast an open source project, but the history of the Gigablast project is such that the project started at an era, when not even the MySQL was available, so the author, the Matt Wells, had to get by by implementing some rudimentary database surrogates himself and learn about the whole thing at the same time, because the Internet was also in its infancy, no fancy database scientific papers like in year 2020, and obviously if the extra hard work of wrestling with character encodings, linguistics and all the rest is added on top of that hard work, then a 1-man-project obviously has to start cutting corners, even if the author were a super-man. Ironically we can say that isn't it great that he did not have to start implementing his own programming language like the 70-ties and 80-ties guys had to do. A year 2020 software developer has so many shoulders of giants to stand on, but it is also evident that by 2020 "standards", the giants are buried in the mud almost to their neck and we are aware of the locations of stones, which can be used as replacements to the shoulders of the giants.

That is to say, the Gigablast had its place in the history, but time has moved on and the Gigablast project should be seen as a museum piece, a lot like old planes are seen at some aviation museum: interesting to watch, but not so good for flying even if they have been fully renovated, restored to their original functionality to the point that some brave souls dear to offer demo flights with them.

Thank You for reading my comment.

lisawebcoder commented 4 years ago

hello thank you for your eloquent comment albeit you did not directly or at all address my comment but its ok i do appreciate you having taking time to respond

i have the API running i suppose in a partial way but i cant directly parse the json to output on a webpage as text or marked up code like html i am only able to dispaly the results when i put format=html but theat gives the gigablast.com results page

i respectlfully want my own url page witht the rsults

do you know how to parse the rsults fom json? do you have yout API working? is it open source? if you want to see my api how it works plz let me know and if you know how to putput json to the webpage plz advise and good day

lisa

martinvahi commented 4 years ago

@lisawebcoder You may want to read

("Search Engines, Information Retrieval in Practice") https://ciir.cs.umass.edu/downloads/SEIRiP.pdf

It's a "legally" freely downloadable book about search engines. Related links:

https://www.lemurproject.org/

(Seems to include the source of the Lemur project successor, project Galago.) https://sourceforge.net/projects/lemur/

You may also consider the following project as a component of Your experiments: http://www.seg.rmit.edu.au/zettair/

martinvahi commented 4 years ago

@lisawebcoder A few very abstract, but software architecture wise important, observations:

Thank You for reading my comment(s).

lisawebcoder commented 4 years ago

thank you for your valuable information and its very kind of you

i will work and study up on this

good day

Lisa

lisawebcoder commented 4 years ago

hello yes i agree with your point 4 about the the linux OS also if i am not mistaken the P2P type system for decentalized search engines machine nodes is it also with yioop? i dont know if gigablast is decentalized no its seems central but i dont know i guess off premise computers is an option if we cant get several computers for a project or its too much maintenance

good day

martinvahi commented 4 years ago

@lisawebcoder

also if i am not mistaken the P2P type system for decentalized search engines machine nodes is it also with yioop?

Thanks to Your comment here I was able to add the Yioop to my wiki that contains a references to various P2P-systems. I wasn't aware of the Yioop before. I admit that my Silktorrent wiki, a Fossil repository, is currently still a mess. The fossil-repository is basically a ~40GiB "document" that is part of my own learning process, my set of reusable notes about P2P-systems, where I try to gather a lot of references, including generally offline academic works from the archive.org, and then I try to study the projects, their architectures, the theory behind them, so it's an on-going process for me. The messy fossil-repository: https://www.softf1.com/cgi-bin/tree1/technology/flaws/silktorrent.bash/wiki?name=List+of+Similar+Projects Download link and local usage instructions MIGHT reside at: https://www.softf1.com/cgi-bin/tree1/technology/flaws/silktorrent.bash/wiki?name=User+Guide Actually, I update the online version only a few times a year, which explains, why You won't find the Yioop reference from the current online version.

I might be repeating myself here, I have lost track, what I have written to whom, but the issues with any P2P systems include P2P-system specific attacks, as with all systems, and one of those attacks is that the attacker adds its own, modified, malicious, nodes to the P2P-network and from that point onwards things can vary. For example, once upon a time the NSA tried to add a lot of nodes to the Tor network to conduct a deanonymization attack, but there might be various forms of P2P-system specific DoS attacks (read: DoS-attacks can be a form of censorship, like radio jammers are for radio communications). That is to say, when constructing a P2P-system that aims to combat censorship and deliver privacy, then regardless of whether it is some WiFi based physical mesh network or some overlay-network like Tor, the security aspects need to be taken to account from the very start, at the core design.

As of 2020_06 I do not have my own P2P-search engine or any other related code for a demo, but as part of learning about P2P-systems I have been thinking, what the architecture of a P2P-search-engine MIGHT look like if the possible attacks (that I'm aware of) are taken to account and what I have came up with, as of 2020_06, is described at

https://www.softf1.com/cgi-bin/tree1/technology/flaws/silktorrent.bash/wiki?name=Application+Example:+Distributed+Search+Engine

That spec draft of mine is probably full of nonsense and once I really start to write it, I might do it very differently, but as of 2020_06_08 the ideas of "artifact evaluation standardization" and a "personal reputation list" have prevailed in my mind.

A fundamental property of of P2P-systems is that as the data is stored among nodes, it's the nodes that have to upload the data to other nodes that want to see it. Network traffic can be throttled and the BitTorrent idea still keeps the download speeds fast even if individual node's upload traffic has been throttled, so the network traffic is not rally a problem, but storage device(HDD, SSD, etc.) access is certainly a very serious problem even, when there's network bandwidth to spare, because HDD/SSD-access slows the system as a whole down substantially. Easy to test: run some BitTorrent client with some popular Linux ISO images or other popular files on Linux and every time HDD is accessed, the whole system hangs for a tiny fraction of a second. Supposedly that's a matter of Linux scheduler and supposedly there are other schedulers to choose from than the standard one, but I haven't tested the alternatives yet, if the alternatives even exist at all.

Storage space is also a serious problem. The SSD-space of the fancy small laptops is quite limited and unlike a desktop computer, there is no option to add extra HDD/SSD devices to a fancy, slim, laptop. That is to say, the typical modern end-user computers are practically terminals, not workstations. iPads and other "Pads" might be seen even as "dumb terminals", because due to a lack of a proper keyboard they do not really allow the end user to express themselves. The end-users of the various "Pads" can mostly just watch passively or type in dumbed-down Tweets from touch-screen.

A solution to the storage space problem and the HDD/SSD access induced slowdown problem might be what I call "a book-shelf server". In the past people had book-shelves for storing data carriers at home, paper books. The book-shelves held also items like family photo albums, family history documentation, encyclopedias. Nowadays photos are digital, the rest of the data is also in a form that is meant to be consumed with a computer, including home videos and podcasts. It makes perfect sense to have a book-self server that never shares the home videos to anywhere outside of home LAN, but contributes to the distributed hosting of the the "encyclopedia" and blog post and free/public books and censorship resistant journalism categories of the classical book-shelf. (Think of "forbidden books" that some regime wants to burn.)

ThePirateBay has created exactly that kind of a system, a "new WWW", and it is called ZeroNet, with the exception that it does not contain the private data storage part, which shouldn't be part of the "new WWW" anyway, and it lacks a kind of multi-user support that is needed at a family setting. Let's just say, the ZeroNet is still a work in progress. https://zeronet.io/ It's pretty hard to install, so I repackaged one very old version of it that I self use. I managed to get it to work even on the Windows10 Linux layer: https://osdn.net/projects/mmmv-repackaging-projects-t1/releases/72265 Another system very similar to it is the Beaker: https://beakerbrowser.com/ but in my view the Beaker is easier for ordinary end-users to install, but technically a failure in a sense that due to being integrated with a web browser it is far harder to maintain it and port it, which makes the ZeroNet a far more future-proof system than the Beaker is. Historic examples of ZeroNet like systems include the absolute classics of P2P-web, the https://freenetproject.org/ and then a less popular system, the I2P https://geti2p.net/ One of the problems with the Freenet and the I2P is that they have been written in Java, which is hard to port, which in turn makes the Freenet and I2P hard to port.

The core idea of the ZeroNet is that static web sites, "zites" in ZeroNet lingo, are folders with HTML, JavaScript, CSS, etc. and those are then distributed around the ZeroNet P2P-network. The zites are versioned, only the creator of a zite can inclrement its version and newer copies will overwrite older copies on ZeroNet P2P-network, unless a particular node has settings that tells to not overwrite the old version. A forum is just a jointly editable zite, a pile of messages, and the zite author sets the access rights of its zites. The ZeroNet node is a Python program that runs a web server, which serves the zites, a lot like a WiFi router serves its administration GUI. Zites (HTML + browser side JavaScript + CSS) can communicate with the local ZeroNet node through a ZeroNet JavaScript API.

Basically, the core architecture of the ZeroNet is exactly what the founder of the archive.org, the Brewster Kahle, asked for: https://blog.archive.org/2015/02/11/locking-the-web-open-a-call-for-a-distributed-web/ The end user view of the ZeroNet needs work, specially the multi-user support in terms of sharing the zite copies, but that might be done with wrapper scripts without changing anything about the current ZeroNet node implementation. In theory, placing all of the users on a single partition of a copy-on-write (COW) file system might be part of the solution.

There is a thread about search engines for the ZeroNet. A ZeroNet URL: http://127.0.0.1:43110/Talk.ZeroNetwork.bit/?Topic:1590565317_17AMQVbBa12XB3xFWKDnP2dCEx92TaiY7X/How+do+you+do+content+discovery As of 2020 the ZeroNet uses link collections that have some fancy JavaScript based GUI that mimics a search line, a lot like the Ruby documentation project uses at https://ruby-doc.org/core-2.7.1/ but the idea of a P2P-search engine is that nodes do not download all data, only the data that is interesting to the owner of a node. That requires real-time communication like offered by the Tor or the https://matrix.org/ or any of the real-time anonymous P2P-chat applications. A search engine API might be implemented by using a chat bot. A citation from the paper

The Economics of Censorship Resistance 
by George Danezis and Ross Anderson

https://www.cl.cam.ac.uk/~rja14/Papers/redblue.pdf

"Early peer-to-peer systems, such as the Eternity Service, sought to achieve censorshop resistance by distributing content randomly over the whole Internet. An alternative approach is to encourage nodes to serve resources they are interested in."

Oh well, I did not plan to write such a long comment, when I started to write this, but thank You for reading it.

lisawebcoder commented 4 years ago

hello you are welcome and thank you for your awesome comments good day Lisa