medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
329 stars 59 forks source link

Error Loading Status #403

Closed mere13 closed 3 years ago

mere13 commented 3 years ago

Hi there!

I'm working through Discovered web entities (removing, en masse, the social media outlets) and the system is removing them; however, it keeps disconnecting me after each time I set a group to OUT. I have to actually close my terminal and Docker and relaunch everything. I'm not sure if I'm doing something wrong or if it's a temporary glitch or something else? Any advice would be appreciated.

Also, the most recent crawl job doesn't seem to have logged the cited statistics?

Sorry, another update... I've also just realized it appears to still be indexing but when I first logged on today all the jobs showed completed. I'm wondering if I've somehow caused an error in the corpus.

As always, thank you! MP

boogheta commented 3 years ago

Hello @mere13, That is quite a surprising and definitely unexpected behavior indeed!... Can you paste any log of the backend container before it gets closed down? Does the behavior still happen now that the corpus indexed the whole crawl? It could also be that you're taking too much time on the PROSPECT page without taking any action and the corpus gets automatically closed down after some idle period, resulting in disconnecting you when acting again in the web interface?

mere13 commented 3 years ago

Thanks for getting back with me. :)

Weirdly, it showed no crawl jobs still working yesterday morning and then when I logged back in, it shows 5 jobs still indexing. These jobs are still indexing today. The log is moving, so it appears to be working, but the numbers aren't changing at all-- and it still shows 0 for all cites.

I have had it happen where I take too long prospecting and the corpus auto closes. What's happening now is not that. It's definitely something else. I'm attaching the most recent log for reference. I think I got everything captured within it. RecentLog.txt

mere13 commented 3 years ago

Quick update. These last 5 sites are still indexing two weeks later at a rate of about 200 a day. Is that normal? The first two rounds didn't take near this long.

Thank you, as always!

boogheta commented 3 years ago

It is indexing only 200 crawled pages per day? This is not quite normal indeed. May I ask a few links to the websites and the depth of the crawls to try and understand why it would behave like this?

mere13 commented 3 years ago

Yes, roughly 200/day. Today, it's more like 400 but most days it only changes by about 200.

Sure! It's indexing only 5 remaining sites: sinnsofattraction.blogspot.com; aaronsleazy.blogspot.com; Kshatriyas-anglobitch.blogspot.com; fastseduction.com; angloamerica101.wordpress.com. The depth was set to 3. These are big sites with a lot of links, but earlier rounds had similar sites and didn't take this long so I'm wondering if something else is going on.

boogheta commented 3 years ago

I tried to run crawls with a depth 2 on these 5 websites (note that kshatriyas-anglobitch is actually disappeared btw) on one of our Hyphe instances and it seems to behave normally with already 7750 pages crawled, including 7725 indexed ones and it keeps running, so it seems to me more like a problem with your setup than with Hyphe itself. Is your corpus very big? Can you report the different numbers from the 4 "Current activity" blocks at the top of the Overview tab? Also, how fast is your server? How much ram and cpus available do you have?

mere13 commented 3 years ago

That's so weird. The corpus is pretty big (90k roughly) but more than half of that is from rounds 1 and 2.

sinnsofattraction.blogspot.com PI (8250) PC (8392) pages (60908) links (12636362) aaronsleazy.blogspot.com PI (5500) PC (8714) pages (32730) links (11660242) Kshatriyas-anglobitch.blogspot.com PI (5500) PC (8052) pages (38188) links (14108582) i just checked and it's still up for me fastseduction.com PI (7750) PC (13381) pages (13332) links (251649) angloamerica101.wordpress.com PI (25750) PC (27724) pages (171669) links (43024582)

I'm on the university servers, and I'm not actually sure the ram and cpus. In hindsight, I should have run it on our teams dedicated server bc I know the ram, etc. on it, but I didn't want to tie that up for other researchers and the first round went super fast so I didn't think it would be an issue.

I definitely don't think it's a Hyphe issue. I'm actually wondering if (1) it's taking this long bc these particular sites are so big and/or (2) because one of these sites is blocking it and causing a snag?

mere13 commented 3 years ago

Also, a huge THANK YOU for checking that on your end!

boogheta commented 3 years ago

This is really weird regarding Kshatriyas-anglobitch.blogspot.com it even tells me it's available and I can register to take it over on my side. But anyways this shouldn't be linked to your issue.

I would have preferred the global numbers for the corpus from the OVERWIEW menu, but the figures you give make me think the problem might come from a crazy amount of links found in the webpages of these blogs (11M links found within only 5500 pages for instance for aaronsleazy), so I guess the indexation part takes a very long time to index all these links inside our memory structure, and is maybe slowed down by a heavily loaded server.

I don't really know what to propose you to do apart from waiting but that seems quite long. We did encounter somethig similar with other blogspots in the past, it seems like some of their pages include thousands of links for whatever reason. Sorry it looks like you encountered one of the limitations of Hyphe unfortunately :(

mere13 commented 3 years ago

That is so strange. I wonder if I'm seeing a cached version when I look at Kshatriyas-anglobitch.blogspot.com but that seems highly unlikely.

Ah, sorry about that. The global numbers are 1.672.147 crawled pages, 1.658.884 indexed pages, 13.263 unindexed pages, 13.766.136 pages found, 234.626.600 links found.

So far, I have 595 set to in, 886 undecided, 50.596 set to out and there are 38.817 and counting discovered.

Yeah, the waiting seems long indeed. I'm also not sure how to resolve in a more timely way. I don't think I can justify just stopping the indexing but I'm also stalled on my work and I'm not sure the best path.

Honestly, I don't think it's a limitation of Hyphe. It's a great tool. Once I'm no longer living on a grad student stipend, I'll definitely donate to the project. :)

boogheta commented 3 years ago

All right, so yes this confirms my fear, 250 million links found is really big for only 600 crawled entities. I think what happens is that it tries to rebuild the links everytime it indexed a batch of 200 pages because it took a long time, and the links building takes even more time, slowing everything completely. If you have your hand on the server and can edit the python code running it, I can point you to a direction to bypass links calculation for the time being and run it only once everything is finished to index which would fasten things a little bit, but if you are on a docker install I can't guarantee you can do that easily

mere13 commented 3 years ago

Ah ha. That makes complete sense. I wondered about the massive amount of links for a relatively small number of pages, but this is my first HNA so it didn't raise any major red flags for me. Lesson learned for future I suppose. :)

Unfortunately, I'm on Docker this time. Am I stuck in this case?

boogheta commented 3 years ago

If you know a little bit how to play around with Docker volumes and unix command line it is doable but not trivial: the idea would be to open a shell inside the hyphe_backend container then open within the core.tac file with a text editor like nano or vim and do the following change: comment lines 2200 to 2205 of the file (add a # at the beginning of each line) and add an "if False:" instead so that the file looks like this:

#        if pages_crawled and self.corpora[corpus]['recent_changes'] and self.corpora[corpus]['pages_queued'] < 25000 and (
#            # pagesqueue is empty
#            not self.corpora[corpus]['pages_queued'] or
#            # links were not built since more than 8 times the time it takes
#            (s - self.corpora[corpus]['last_links_loop'] > 8 * self.corpora[corpus]['links_duration'])
#          ):
        if False:

Then reboot the container and restart the corpus. It should fasten quite a bit the indexation, then when it's over redo the same procedure, revert to the original codebase and restart again and it should build the links for a long time and be finally over.

mere13 commented 3 years ago

Okay, thank you. I'll give this a go later today. I appreciate your help!