Namespace Crawls and Datasets by Project

nasa-jpl-memex / memex-explorer

Viewers for statistics and dashboarding of Domain Search Engine data

BSD 2-Clause "Simplified" License

121 stars 69 forks source link

Namespace Crawls and Datasets by Project #623

Open brittainhard opened 9 years ago

brittainhard commented 9 years ago

Right now every crawl and dataset name must be unique because its name is not associated with any project. We can make it so that you can have crawls and datasets by the same name in different projects by basing the namespace off of the project name.

tonyfast commented 9 years ago

I would suggest namespacing the URL too.

If the crawl space is foo and the crawler is called bar then you could access it as

http://explorer.continuum.io/explore/foo/bar something like that.

brittainhard commented 9 years ago

The easiest thing here is to control the name of the index as its created. For both nutch and ache we can supply a custom nam. We can have the name of the index reflect its related project.

After this, It seems to me that I can append some key/values to the index indicating creation date, index type (crawl or dataset), and crawler type. This seems like a fairly straightforward and quick way to do it. What do you guys think @tonyfast @ahmadia @kriehl

As for changing the foreign key relations of crawl, crawlmodel, etc, that seems like a separate issue.

brittainhard commented 9 years ago

@kriehl @ahmadia

So I’m working on this pr: https://github.com/memex-explorer/memex-explorer/pull/647

I realized that the simplest solution would be to just add the crawler type into the name, as well as the project name. Right now it looks like this:

@property
def index_name(self):
    return "%s_%s_%s" % (self.slug, self.project.slug, self.crawler

This is added when the index is created. Aron had the idea of creating a separate index that contains info about each index we create and its associated project and crawl.

This is really the simplest fix I can come up with to this problem. Let me know if this is sufficient.

ahmadia commented 9 years ago

My only comment on this was the danger of being unable to filter properly due to an incomplete separation of fields.

I don't fully understand how filters work in ES, but my initial idea would be that it would be easier to add this information into a separate "meta-index" of crawl information. If the information is only contained in the index name, then filtering becomes a bit sloppier, since projects and crawls can be named anything, so potentially if somebody had "ache", "nutch", or "dataset" in their project or crawl name it would make it harder to filter these types of indices.