medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
329 stars 59 forks source link

"Node has no left sibling" when calling `paginate_webentity_pagelinks` #462

Open dale-wahl opened 2 years ago

dale-wahl commented 2 years ago

I'm running into an issue getting the links from a particular webentity. I keep receiving a "Node has no left sibling" message instead. I'm assuming it has to do with a particular link since I'm able to get the first set of links from the webentity. Is there any way I can go find the culprit to remove it and collect the rest of the links for the network? Thanks!

Backend Docker container logs:

2022-06-16 08:02:53+0000 [DEBUG - QUERY from MYIP, 192.168.0.3] {u'params': [10393, 10, u'31607|230|0#2Zm', False, u'ic-2-356581'], u'method': u'store.paginate_webentity_pagelinks_network'}
2022-06-16 08:02:53+0000 [INFO - ic-2-356581] Traph client query: paginate_webentity_pagelinks [10393, ["s:https|h:org|h:immunize|h:www|", "s:http|h:org|h:immunize|", "s:http|h:org|h:immunize|h:www|"]] {"include_outbound": false, "pagination_token": "0#2Zm", "source_page_count": 10}
2022-06-16 08:02:53+0000 [INFO - ic-2-356581] Traph server answer: {"query": "paginate_webentity_pagelinks", "code": "success", "result": {"done": false, "token": "0#9pv", "count_pagelinks": 1187, "count_sourcepages": 10, "pagelinks": [["s:https|h:org|h:immunize|h:www|p:vw|", "s:https|h:org|h:immunize|h:www|p:vw|p:|", 1], ["s:https|h:org|h:immunize|h:www|p:vw|p:archive.asp|", "s:https|h:org|h:immunize|h:www|p:vw|", 1], ["s:https|h:org|h:immunize|h:www|p:vw|p:|", "s:https|h:org|h:immunize|h:www|p:vax-and-covid-19|p:|", 1], ["s:https|h:org|h:immunize|h:www|p:vw|p ... [132994 cars truncated]
2022-06-16 08:02:53+0000 [DEBUG - ANSWER] store.paginate_webentity_pagelinks_network: "{\"jsonrpc\": \"2.0\", \"result\": {\"code\": \"success\", \"result\": {\"token\": \"32794|240|0#9pv\", \"links\": [[\"s:https|h:org|h:immunize|h:www|p:vw|\", \"s:https|h:org|h:immunize|h:www|p:vw|p:|\", 1], [\"s:https|h:org|h:immunize|h:www|p:vw|p:archive.asp|\", \"s:https|h:org|h:immunize|h:www|p:vw|\", 1], [\"s:https|h:org|h:immunize|h:www|p:vw|p:|\", \"s:https|h:org|h:immunize|h:www|p:vax-and-covid-19|p:|\", 1], [\"s:https|h:org|h:immunize|h:www|p:vw|p:|\", \"s:https|h:org|h:immunize|h:www|p:laws|p:exemptions.asp|\", 1], [\"s:https|h:org|h:immunize|h:www|p:vw|p:|\", \"s:https|h:org|h:immunize|h:www|p:news|\", 1], [\"s:https|h:org|h:immunize|h:www|p:vw|p:|\", \"s:https|h:org|h:immunize|h:www|p:subscribe|\", 1], [\"s:https|h:org|h:immunize|h:www|p:vw|p:|\", \"s:https|h:org|h:immunize|h:www|p:acip|\", 1], [\"s:https|h:org|h:immunize|h:www|p:vw|p:|\", \"s:https|h:org|h:immunize|h:www|p:shop|\", 1], [\"s:https|h:org|h:immunize|h:www|p:vw|p:|\", \"s:https|h:org|h:immunize|h:www|p:mening ... [137206 cars truncated]
2022-06-16 08:02:53+0000 [DEBUG - QUERY from MYIP, 192.168.0.3] {u'params': [10393, 10, u'32794|240|0#9pv', False, u'ic-2-356581'], u'method': u'store.paginate_webentity_pagelinks_network'}
2022-06-16 08:02:53+0000 [INFO - ic-2-356581] Traph client query: paginate_webentity_pagelinks [10393, ["s:https|h:org|h:immunize|h:www|", "s:http|h:org|h:immunize|", "s:http|h:org|h:immunize|h:www|"]] {"include_outbound": false, "pagination_token": "0#9pv", "source_page_count": 10}
2022-06-16 08:02:53+0000 [INFO - ic-2-356581] Traph server answer: {"query": {"args": [10393, ["s:https|h:org|h:immunize|h:www|", "s:http|h:org|h:immunize|", "s:http|h:org|h:immunize|h:www|"]], "method": "paginate_webentity_pagelinks", "kwargs": {"include_outbound": false, "pagination_token": "0#9pv", "source_page_count": 10}}, "message": "Node has no left sibling.", "code": "fail"}
2022-06-16 08:02:53+0000 [DEBUG - ANSWER] store.paginate_webentity_pagelinks_network: "{\"jsonrpc\": \"2.0\", \"result\": {\"query\": {\"args\": [10393, [\"s:https|h:org|h:immunize|h:www|\", \"s:http|h:org|h:immunize|\", \"s:http|h:org|h:immunize|h:www|\"]], \"method\": \"paginate_webentity_pagelinks\", \"kwargs\": {\"include_outbound\": false, \"pagination_token\": \"0#9pv\", \"source_page_count\": 10}}, \"message\": \"Node has no left sibling.\", \"code\": \"fail\"}, \"id\": null}"
boogheta commented 2 years ago

Hello Dale, That's a first sorry, we would need to investigate a bit to understand what's happening. Is your corpus big? Could you try and share with us its traph data? (you should have a traph-data directory either in your hyphe one or under the DATA_PATH you might have set in your .env file, and it should contain one directory per corpus id. An alternative would be to share with us a dump of your corpus' pages collection from the mongodb container, using mongodump -d "hyphe_CORPUSID" -c pages within the container

dale-wahl commented 2 years ago

It's not my biggest network... But it's the first time I've seen this! 25k webentities.

Here is the traph-data directory. Let me know if you want the mongo dump as well.

boogheta commented 2 years ago

Hello Dale, The file requires authorization access, I've requested it yesterday but didn't get it yet.

dale-wahl commented 2 years ago

Hey @boogheta, I accepted your request for access a while ago. Just commenting here in case you didn't see the notification.

boogheta commented 2 years ago

Hello Dale, I got it yes, @Yomguithereal started looking at it but we don't have many leads yet If you're in a hurry, I'm afraid you should probably rather restart the corpus from scratch (which is a bit of a pain I know... :/ )

dale-wahl commented 2 years ago

No rush. I'll look at load and see if I can recollect.

I do wish I could skip the one link (or page) and collect the rest of the network, but am unsure how to do that.