medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
329 stars 59 forks source link

'DISCOVERED' Web entities can exist without pages, according to API #354

Closed stijn-uva closed 5 years ago

stijn-uva commented 5 years ago

After creating and starting a corpus with the following API calls:

CALL: '{"jsonrpc":"2.0","id":"5db180ffb0d8e5.14937038","method":"list_corpus","params":[]}'
CALL: '{"jsonrpc":"2.0","id":"5db180ffb1f155.75289802","method":"create_corpus","params":["ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db181072fd3a0.99407688","method":"start_corpus","params":["ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db1810734c7a8.22095331","method":"ping","params":["ic-1-36",5]}'
CALL: '{"jsonrpc":"2.0","id":"5db18108affd16.10274562","method":"get_corpus_options","params":["ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db18108b099d6.78391921","method":"set_corpus_options","params":["ic-1-36",{"defaultCreationRule":"subdomain"}]}'
CALL: '{"jsonrpc":"2.0","id":"5db1810ec02b75.95668205","method":"start_corpus","params":["ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db1810eeab977.75702241","method":"ping","params":["ic-1-36",5]}'
CALL: '{"jsonrpc":"2.0","id":"5db18110718737.11728454","method":"get_corpus_options","params":["ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db1811071fe34.60086515","method":"set_corpus_options","params":["ic-1-36",{"max_depth":1}]}'
CALL: '{"jsonrpc":"2.0","id":"5db18110737b00.27223262","method":"start_corpus","params":["ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db1811077a4f5.87714785","method":"ping","params":["ic-1-36",5]}'
CALL: '{"jsonrpc":"2.0","id":"5db181117835e8.67996205","method":"get_corpus_options","params":["ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db18111799964.37778059","method":"declare_pages","params":[["http:\\/\\/test1.issuecrawler.net","http:\\/\\/test2.issuecrawler.net","http:\\/\\/test3.issuecrawler.net"],"ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db18111b0d627.45607864","method":"crawl_webentity","params":[1,1,false,"IN",{},"ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db18111cac4d7.09084135","method":"crawl_webentity","params":[2,1,false,"IN",{},"ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db18111dab7e8.70411405","method":"crawl_webentity","params":[3,1,false,"IN",{},"ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db18111ea10a9.68664525","method":"start_corpus","params":["ic-1-36"]}'
CALL: '{"jsonrpc":"2.0","id":"5db18111ece899.61361201","method":"ping","params":["ic-1-36",5]}'

And then waiting for all crawls to complete, web entities without any pages associated with them exist:

CALL: '{"jsonrpc":"2.0","id":"5db181919c1648.55797009","method":"store.get_webentity_pages","params":[4,true,"ic-1-36"]}'
string(43) "{
    "code": "success",
    "result": []
}"

This is the data for the web entity itself, according to the API:

CALL: '{"jsonrpc":"2.0","id":"5db181d05e0280.83722695","method":"store.get_webentity","params":[4,"ic-1-36"]}'
string(958) "{
    "code": "success",
    "result": [
        {
            "status": "DISCOVERED",
            "pages_total": 0,
            "undirected_degree": 3,
            "crawling_status": "UNCRAWLED",
            "indegree": 3,
            "tags": [],
            "outdegree": 0,
            "startpages": [],
            "creation_date": 1571914006623,
            "prefixes": [
                "s:http|h:net|h:digitalmethods|h:wiki|",
                "s:http|h:net|h:digitalmethods|h:wiki|h:www|",
                "s:https|h:net|h:digitalmethods|h:wiki|",
                "s:https|h:net|h:digitalmethods|h:wiki|h:www|"
            ],
            "indexing_status": "UNINDEXED",
            "crawled": false,
            "last_modification_date": 1571914006623,
            "pages_crawled": 0,
            "_id": 4,
            "homepage": "https:\/\/wiki.digitalmethods.net",
            "id": 4,
            "name": "wiki.Digitalmethods.net"
        }
    ]
}"
boogheta commented 5 years ago

it does not return any page because the onlyCrawled argument of store.get_webentity_pages is set to True, while this webentity was not crawled yet, so there are indeed so far no crawled pages for the wiki WE (which comes from the redirection of the https prefix of the other ones)