medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
329 stars 59 forks source link

Delete Crawled Entity #398

Closed mere13 closed 3 years ago

mere13 commented 3 years ago

Is it possible to delete a crawled entity (thus, also removing all its associated links/entities)? I don't see this option, but I'm new to Hyphe. For context, I inadvertently set a major blogging site to IN and only realized once all the crawls (n=88) were complete. I don't want this or the associated links that arose from it in my corpus. It doesn't seem that changing its status to OUT removes the links from Discovered?

Thanks for your help!

boogheta commented 3 years ago

Hi @mere13. Hyphe's structure is such that deleting a webentity does not make much sense as it exists and there are links to it or from it which participate to the structure of the other. But setting iot to OUT is the proper way of working: when you do so, links coming from it or to it will still exist and be visible when enabling OUT entites inside the network view, but they won't be accounted in the degree totals anymore (it might take a few seconds to be propagated). See for example here in the demo: I crawled medialab.sciencespo.fr then set it as OUT, in the PROSPECT view everything has a cited count of 0 https://hyphe.medialab.sciences-po.fr/demo/#/project/test-1/prospect Although it makes me realise there might be a bug in the network view sinc it won't be display the discovered there... I'll investigate this. (edit: nope it's normal actually, it's just because i forgot to remove the filter on undisconnected entities in the dropdown menu on the right) (second edit: still, they look disconnected which is not appropriate, will open a separate issue for this)

mere13 commented 3 years ago

Thank you for the quick reply! Yes, agreed... deleting it doesn't make much sense. However, when I set the entity to OUT, it doesn't seem to impact the number of DISCOVERED prospects, which is strange bc it crawled a lot of pages (tens of thousands). I would have expected the number of DISCOVERED prospects to be reduced in this case-- maybe it is the bug you mention?

This actually brings up a related question. I am currently on round two of crawling (so, started from seed sites A and B; set all of the DISCOVERED entities to IN, OUT or UNDECIDED; crawled all those set to IN). Now, I want to set the newly discovered round 2 entities. I have a list of almost 36K, which seems to include many entities I'd already set in round one but doesn't note them as being IN, OUT, etc. Is that normal? Am I missing a filter I should have on? It would be helpful if the 2K entities I've already coded didn't have to be coded again.

Thank you, again.

boogheta commented 3 years ago

Ha ok, I understand the problem : you're not using Hyphe the way it has been conceived ;) The idea of Hyphe is to selectively curate your corpus using the PROSPECT tab by only picking entities part that make sense for your study, so you should never include all discovered entities. If you want to do an unfiltered automatic crawl, you should rather use other tools designed for this such as IssueCrawler or VOSON. This is why it is normal that the entities which were discovered by entites set to OUT are not removed : they still exist within Hyphe's memory, but their degree is 0 and so in the prospection which ranks entites by degree, they will only appear at the bottom of the list.

edit: I encourage you to read/watch the tutorials listed at the bottom of the readme to well understand the principles behind hyphe and how to properly build corpuses with it :)

mere13 commented 3 years ago

I have watched the videos and gone through the wiki, and I'm pretty sure I'm using it correctly (I think, although now you've got me concerned). I may have misstated and caused some confusion. For example, I had seed sites A and B. Let's say CDC and WebMD (as an example). From those two seed sites, I ended up with 2000 DISCOVERED entities. Of those entities, I went through and decided 88 of them were relevant to my study related to vaccines. I then crawled those 88 vaccine-related entities but not the other 1912. That second crawl resulted in 35,000 DISCOVERED entities. Now, I need to go through those and decide which are relevant to my study on vaccines. I don't mind the OUT still being in the corpus-- that's actually great. I was just wondering if there's a way to not have to "remember" the 1912 already set to OUT in the new (now, round three) crawl. [edit]: I'll keep playing with it. Thanks for your help, and have a great day!

boogheta commented 3 years ago

All right, then what you describe is good!

I first understood you wanted to include all discovered at each step which would be humanely impossible using Hyphe, but if you do the iterative selection step by step then it's perfect. You should just kow you don't have to go through and review absolutely all discovered. A methodological approach to this would be to fix a limit and browse all discovered entities with a degree above this value, and maybe use the search input to search among the rest specific words to find in their urls but that's it.

And regarding OUT like I said: no worries, making your entity OUT did unaccount the links to the corresponding discovered, lowering them in the prospection list, even though they still exist, you will most probably not explore them.

Have a good time with Hyphe!

Best,

(I'm closing this issue but don't hesitate and reopen it if you encounter more problems)