gigablast / open-source-search-engine

Nov 20 2017 -- A distributed open source search engine and spider/crawler written in C/C++ for Linux on Intel/AMD. From gigablast dot com, which has binaries for download. See the README.md file at the very bottom of this page for instructions.
Apache License 2.0
1.54k stars 441 forks source link

Anyone know why there is a Segmentation fault/core dump? after submitting a URL to crawl? #204

Open tcreek opened 1 year ago

tcreek commented 1 year ago

I am getting that, and seems others are also:

https://github.com/gigablast/open-source-search-engine/issues/199#issue-1550056077

twistdroach commented 1 year ago

Hey I realize this is an old question, but the segfaults are due to a couple different issues. I fixed a few of them and am able to run a cluster of nodes doing lots of spidering (with and without proxies) and service random queries for 24 hours or so without a segfault so far.

My code is here: https://code.moldybits.net/Forks/open-source-search-engine/commits/branch/devel

The relevant commits are: 09a32168997dc4252ca252a97ae4dcfe8b07d186 35e793577990f8344d707a8169bad2efc87d3b2e d4c60a2086c7db0fde6f199c9291e8f335dcd19c

My plan is to get the optimization turned on across the Makefile, get it to run reliably, then publish some new packages and tune up some of the documentation. All in all, I think this is a cool project and deserves a little love to keep it alive.

tcreek commented 1 year ago

Hey I realize this is an old question, but the segfaults are due to a couple different issues. I fixed a few of them and am able to run a cluster of nodes doing lots of spidering (with and without proxies) and service random queries for 24 hours or so without a segfault so far.

My code is here: https://code.moldybits.net/Forks/open-source-search-engine/commits/branch/devel

The relevant commits are: 09a32168997dc4252ca252a97ae4dcfe8b07d186 35e793577990f8344d707a8169bad2efc87d3b2e d4c60a2086c7db0fde6f199c9291e8f335dcd19c

My plan is to get the optimization turned on across the Makefile, get it to run reliably, then publish some new packages and tune up some of the documentation. All in all, I think this is a cool project and deserves a little love to keep it alive.

Thanks for continuing the work. I cloned the dev URL and tried to compile on Debian 12, but it throws some errors

Mem.h:219:35: error: ISO C++17 does not allow dynamic exception specifications 219 | void operator new (size_t size) throw (std::bad_alloc); | ^~~~~ Mem.h:219:35: error: ISO C++17 does not allow dynamic exception specifications 219 | void operator new (size_t size) throw (std::bad_alloc); | ^~~~~

I changed it to void* operator new(size_t size) noexcept;

It is also throwing a lot of warnings.

Here is an example of one of the most common type warning:

Rebalance.cpp:310:24: warning: invalid suffix on literal; C++11 requires a space between literal and string macro [-Wliteral-suffix] 310 | "numhostspershard: %"INT32"\n" | ^

tcreek commented 1 year ago

update. After changing that one line, it compiled just fine.

However, just like all the other times, as soon as a Press "submit" on the Spider search:

1701529580212 000 db: Saving 128000 bytes of cache to /home/trent/open-source-search-engine//dns.cache 1701529580213 000 gb: calling abort from main process with pid of 1194 (main process) 1701529580213 000 gb: Joining with all threads 1701529580214 000 threads: killing all threads 1701529580215 000 gb: Dumping core after saving. Segmentation fault

Better luck on this next time.

subhan109fx commented 11 months ago

if i download this and buiild it am i still able to browse the internet with this search engine ?

tcreek commented 11 months ago

if i download this and buiild it am i still able to browse the internet with this search engine ?

Good luck on getting it running. It seems to be crashing for most people, and those wno have managed to get it running just won't explain how they did get it running.

Anyway, to answer your question, it's a search engine, not a. browser.

Overdrive5 commented 8 months ago

@tcreek , did you ever figure this out?

tcreek commented 8 months ago

No, I gave up. There is another search engine available which is called Qwazr. It was formerly called Open Search Server.

twistdroach commented 7 months ago

hey @Overdrive5 & @tcreek - I just saw these messages now...if either of you are still interested in trying to use this codebase I might fork it on GitHub to maintain (so people can open issues there). Let me know if this is the case. I was able to build it on almalinux & ubuntu (and fedora in the past), but I haven't tried debian recently. I will try it and let you know. Just out of curiosity - what were you going to use this for? I personally played with it for a bit & thought the search was better than I have gotten from the other open source engines I've found.

Let me know what OS's you are planning to use. I believe I had the RPM building a few months ago on alma & fedora but my memory of that is hazy now.

brianrasmusson commented 7 months ago

@twistdroach You will be wasting your life, trust me.

twistdroach commented 7 months ago

Negativity aside - I pushed a docker container for experimenting with this here: https://hub.docker.com/r/moldybits/open-source-search-engine

I also did fork this repo on GitHub & began mirroring my personal repo: https://github.com/twistdroach/open-source-search-engine

I'd be interested in hearing about anyone playing with this and their experiences. The code is old & dusty, but I really do like how well the search & "gigabits" feature seems to work (at least with the small amount of data I have fed it). Anyway, I wouldn't use it for anything important, but it's a fun toy at this point.

Next thing on my list is to revive a patch I had at one point that fixed the segfaults from setting optimization (-O3), but that is not applied in these changes as I haven't played with it in about 6 months and I don't remember where I left off.

subhan109fx commented 7 months ago

Thank you

~ We dont try our best. We do the best at Subhan Inc

On Thu, Apr 11, 2024 at 12:03 AM twistdroach @.***> wrote:

Negativity aside - I pushed a docker container for experimenting with this here: https://hub.docker.com/r/moldybits/open-source-search-engine

I also did fork this repo on GitHub & began mirroring my personal repo: https://github.com/twistdroach/open-source-search-engine

I'd be interested in hearing about anyone playing with this and their experiences. The code is old & dusty, but I really do like the how well the search & "gigabits" feature seems to work (at least with the small amount of data I have fed it). Anyway, I wouldn't use it for anything important, but it's a fun toy at this point.

Next thing on my list is to revive a patch I had at one point that fixed the segfaults from setting optimization (-O3), but that is not applied in these changes as I haven't played with it in about 6 months and I don't remember where I left off.

— Reply to this email directly, view it on GitHub https://github.com/gigablast/open-source-search-engine/issues/204#issuecomment-2048889406, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWXX3JZGIFPEYX5RV6PW5ADY4YDQDAVCNFSM6AAAAAAX25ZBLCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBYHA4DSNBQGY . You are receiving this because you commented.Message ID: @.***>

brianrasmusson commented 7 months ago

We spent years working with the code at Findx, and it is among the biggest regrets of my life. It never became production ready, despite our major rewrites and stability fixes. But as you say, as a toy, interesting enough..

tcreek commented 7 months ago

hey @Overdrive5 & @tcreek - I just saw these messages now...if either of you are still interested in trying to use this codebase I might fork it on GitHub to maintain (so people can open issues there). Let me know if this is the case. I was able to build it on almalinux & ubuntu (and fedora in the past), but I haven't tried debian recently. I will try it and let you know. Just out of curiosity - what were you going to use this for? I personally played with it for a bit & thought the search was better than I have gotten from the other open source engines I've found.

Let me know what OS's you are planning to use. I believe I had the RPM building a few months ago on alma & fedora but my memory of that is hazy now.

Open Search Engine is an abandoned project for some years now

https://sourceforge.net/p/opensearchserve/discussion/947147/thread/a2ef9cfb/?limit=25#3ba5

Seems SSO 2.0 was supposed to be out, but instead they started calling it Qwazr for some reason.

https://github.com/qwazr

https://www.qwazr.com/

Now there has been no activity with Qwazr in a couple of years .

I plan on using Debian 12.

For this GigaBlast I went back to Debian 9 to try to get it to work, and I still get the same result of a segmentation fault

For for wanting to use it?

https://en.wikipedia.org/wiki/Search_engine_manipulation_effect

and Censorship of course.

Overdrive5 commented 7 months ago

@tcreek, I am using a lightweight derivative of ubuntu fossa called FossaPup64 with dev environment added in.

I got it to compile and work without segfault by editing this project Makefile. Changing -O2 to -O0 on lines 86,97 and 101.

I have only tested it with 1 core.

It seems my ISP or DNS clamped down on my account after I scanned 500k+ websites though.

I am wanting censorship free searching as well.

Overdrive5 commented 7 months ago

We spent years working with the code at Findx, and it is among the biggest regrets of my life. It never became production ready, despite our major rewrites and stability fixes. But as you say, as a toy, interesting enough..

@brianrasmusson care to share where you left off so we don't have to reinvent the wheel?

twistdroach commented 7 months ago

@brianrasmusson - ah - findx was previously privacore right? So then you were the source of the privacore repo? Sorry you had such a bad experience with it, but cool to see you still lurking :)

@Overdrive5 - I think their repo is here: https://github.com/privacore/open-source-search-engine

Somewhere I remember reading a comment from someone at privacore - maybe it was you - about the gigablast crawler having a bug that would get you banned when crawling sites - I'm assuming for not honoring robots.txt or going crazy and request bombing due to some bug. Do you know what the bug was? I saw you guys rewrote the crawler portion, so I was never clear on what went wrong there, but always makes me a littler nervous when I play with it...

brianrasmusson commented 7 months ago

Several bugs resulting in both scenarios you describe. Sometimes not respecting robots.txt, other times bombarding a site with requests. Got our crawl servers blocked by firewalls multiple times, and it was a pain. Yes, the privacore fork is ours, but I won't comment on anything in it. That is all behind me. Just got an email notification from github about a case update here, which is what triggered me. So my advice is still - run away.

twistdroach commented 7 months ago

@brianrasmusson understood - thanks for the response. Let me know if you have another open source web search/crawler that needs contributing to.

tcreek commented 7 months ago

@tcreek, I am using a lightweight derivative of ubuntu fossa called FossaPup64 with dev environment added in.

I got it to compile and work without segfault by editing this project Makefile. Changing -O2 to -O0 on lines 86,97 and 101.

I have only tested it with 1 core.

It seems my ISP or DNS clamped down on my account after I scanned 500k+ websites though.

I am wanting censorship free searching as well.

It worked! Thanks so much!

I guess the next update to it should be images

@Overdrive5 Anyway to get it to use more than one core?

Overdrive5 commented 7 months ago

@tcreek , I did get a second core running after quite a bit of frustration. I tried all sorts of things trying to figure out why it would not scp to the extra core directory. I had to: follow instructions per the FAQ install openssh-server, open port 22 on my firewall, (security concern) disable pam in /etc/ssh/sshd_config (security concern)

amongst other things I can no longer remember.

overall I'm not recommending it. Not worth the frustration. Do it at your own risk. I experimented with a fresh clean linux install instead of my production install.

Also, the multi-core process used here was developed 10-25 years ago before other better technologies were developed.

it will populate a separate database for each core in use. So it will fill up a drive much faster with some redundant replication I am guessing. Would love to figure out how to multi-core/single database spidering and single core searching.

I am playing with this for my own personal censorship free search engine for select subject matters I am interested in. I need to improve filtering. My rough calc is "only" USA websites right now (~3.3 billion) would need 100TB+ before mirroring. Maybe more. I have ZERO interest in sucking up the whole internet for general searching. I have built @twistedroach 's fork and it compiles and works with "-O3" for me. So I will probably head in that direction.

Good Luck in your endeavors!

twistdroach commented 7 months ago

I had a system setup about 6 months ago that had 8 or 16 cluster members...it is fiddly. I'll reproduce and try to doc on my fork in a day or two. I find it a shame that this codebase is left mostly unusable due to lack of clear docs and a few good builds.

Going to update my fork shortly to default to -O3, I have done minimal testing with it this way and it seemed mostly stable.

A note about the segfaults - the technique of aborting when reaching some unrecoverable scenario in a server app is common, but it is taken to the extreme here - missing a config file or many other circumstances will result in a segfault. On top of that, there certainly are many real issues left that will cause legit segfaults (I fixed the ones that I ran into using the system lightly). Anyway, just wanted to say, don't be dismayed if you get the occasional crash. The app is built to do that and restart to recover from "unrecoverable" situations. If you get one that is reproducible, feel free to patch and submit a fix if you are able or file an issue on my fork.

@Overdrive5 your use case is exactly what brought me to this codebase!

tcreek commented 7 months ago

@tcreek , I did get a second core running after quite a bit of frustration. I tried all sorts of things trying to figure out why it would not scp to the extra core directory. I had to: follow instructions per the FAQ install openssh-server, open port 22 on my firewall, (security concern) disable pam in /etc/ssh/sshd_config (security concern)

amongst other things I can no longer remember.

overall I'm not recommending it. Not worth the frustration. Do it at your own risk. I experimented with a fresh clean linux install instead of my production install.

Also, the multi-core process used here was developed 10-25 years ago before other better technologies were developed.

it will populate a separate database for each core in use. So it will fill up a drive much faster with some redundant replication I am guessing. Would love to figure out how to multi-core/single database spidering and single core searching.

I am playing with this for my own personal censorship free search engine for select subject matters I am interested in. I need to improve filtering. My rough calc is "only" USA websites right now (~3.3 billion) would need 100TB+ before mirroring. Maybe more. I have ZERO interest in sucking up the whole internet for general searching. I have built @twistedroach 's fork and it compiles and works with "-O3" for me. So I will probably head in that direction.

Good Luck in your endeavors!

Have you or @twistdroach tried Qwazr aka Open Search Server?

https://github.com/qwazr

https://sourceforge.net/p/opensearchserve/discussion/947147/thread/a2ef9cfb/?limit=25#3ba5

Overdrive5 commented 7 months ago

@tcreek

qwazr, is java based. I have near zero java experience.

And I am slightly knowledgeable in C.

So currently, I favor this codebase. Unless I find another C based crawler/search engine.

tcreek commented 7 months ago

Java is based on C++ so it should not be that hard to adapt to