medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
328 stars 59 forks source link

API is sometimes really slow #428

Closed stijn-uva closed 2 years ago

stijn-uva commented 2 years ago

Apologies for the vague issue, but we have a problem where sometimes the API takes over a second per call, and sometimes (for the same corpus) we get a response in a fraction of second. This concerns the store.get_webentities endpoint in particular but also others. It does not seem to depend on whether Hyphe is crawling anything or not, in fact as far as I can see nothing particularly demanding is running in Hyphe at the same time when this occurs. It usually goes away after a restart, but that's not ideal...

Do you have any ideas about what could be causing this and where we might look to address this?

boogheta commented 2 years ago

What's strange imho is that it goes away after a restart. It could make sense that some routes and some specific calls to get_webentity_pagelinks_network or get_webentity_pages would take some time in the traph depending on the corpus and its content, but it should be consistent then. And I see you changed the issue to point get_webentities instead in which case I can think of another potential source of problem, corresponding mostly to collecting DISCOVERED webentities and trying to set them automatically with a homepage (in which case depending on the fields you need, you can try and bypass this slowing operation by using the light or semilight arguments in most get_webentities routes). Could you be more precise on the corresponding calls?

Could you enable DEBUG in your config, set it to 2, and paste the full logs from the query to the answer when you encounter it again?

stijn-uva commented 2 years ago

Thanks, will log for a while and keep an eye on them when this occurs, more news later! :-)

stijn-uva commented 2 years ago

Oh, and the specific call parameters here are:

call = ["store.get_webentities", [], 0, 100, "::page::", False, False, False, self.corpus_id]

Where ::page:: is replaced by an incrementing number until all web entities are collected. So we're calling for the full web entity details indeed, that's something I can also experiment with.

boogheta commented 2 years ago

But do you actually need the homepage field? Otherwise switching to False, True, False might help. Also note that your use of the pagination is not the proper way of this api: you should collect a token after the first request then switch and call instead the get_webentities_page(token, page_number, False, corpus) as explained in the documentation

boogheta commented 2 years ago

Hey @stijn-uva, I'm closing this one for now, but please reopen it if you still encounters similar problems

stijn-uva commented 2 years ago

Hey @boogheta, sorry for never following up on this. We're not having this problem anymore since we started using the pagination feature added to the API a while ago, so it seems safe to close it indeed!