medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
328 stars 59 forks source link

Web entities and crawl limits #8

Closed tommv closed 9 years ago

tommv commented 11 years ago

This is less of a bug report and more of an attempt to open the discussion. Currently the limits of a web entities and the limits of its crawl coincide. This is probably a good idea in most cases, but not necessarily in all cases.

Example: In our cartography of the climate adaptation debate, we have to deal with the website of the Food and Agriculture Organisation. Of course, we don't want crawl this entire website because it is too big and only a portion of it directly concerns climate adaptation. In fact, we are lucky, because they have a sub-directory that is dedicated to climate change (http://www.fao.org/climatechange/). Great! so we only want to crawl this directory. Still, this does not necessarily imply that we only want to limit this entity to this folder. In fact, the FAO is a relatively unitary institution. Someone who want to cite a FAO study for example may as well site the homepage of the FAO website and not necessarily the pages in the sub-directory.

What this example tries to illustrate is that sometime we might want to define a larger web-entity, but only crawl a smaller portion of it (without necessarily reduce the size of the web-entity). Could we think of a way to do this?

boogheta commented 11 years ago

This is kind of a tricky case. It is somehow possible already by first defining the sub-website, crawling it, which will generate a second so-called "parent" webentity for this one which won't be crawled, and afterwards merge the parent into the sub-one making them only one (while only the sub one will have been crawled). Features to redefine and merge webentities aren't completely offered yet on the web interface but these are possible functionnalities.

jacomyma commented 11 years ago

I think that there is a non-technical discussion here. I will reopen this issue so that we have this discussion if needed.

The coincidence of a web entity and the limits of a crawl is intentional. We want to crawl the web and we need to define the limits of a crawl. We tried to stick to user's needs and users think in terms of websites (most of the time). In order to fit to that need we implemented web entities. They are what you have crawled. Of course you can edit a web entity and then reach a state where a web entity is only partially crawled. But this is a side effect and we want users to fix that situation so that every web entity is crawled. In other terms, web entities deserve the purpose of helping users to manage their crawl.

Web entities are good because they are a simple way to cope with a difficult problem. This problem is to define the limits of a crawl so that we have meaningful entities even if the web is large, heterogeneous, and full of singularities (such as redirects). Web entities are the incarnation of a design strategy. We aim at presenting features in terms of results for the user. The user comes with a need: "I want to have a website in my data." We would rather say "Let's define this website (we call that a web entity) and then harvest it, knowing that it requires several steps" than "We have a harvesting feature requiring several steps, starting with the definition of what you call a website". The user searches for a way to achieve goals, and features must appear as answers to these goals. Web entities are our concept for leading the user to cope with the issues of crawling. As a design trick I find it quite efficient, since users seem to quickly understand the concept, while we are able to use it as a solution to different hard-to-design features. We just ask the user to keep believing that web entities are the result of the crawl, and then we lead the user to the different methodological questions of the crawl.

How is it that some users like you want to separate the crawl from a web entity? Maybe the concept of web entity is so transparent that people see different things in it. This is somehow a design success, since you accept the concept of web entity while discussing the issues of crawling. But you probably understand now that if we separate crawl settings from web entities, it leads us to a bigger issue about how to explain the issue of crawling to users. We can nevertheless explore this design space if you have ideas. Feel free to detail the system you would like to use!