perkeep / perkeep

Perkeep (née Camlistore) is your personal storage system for life: a way of storing, syncing, sharing, modelling and backing up content.
https://perkeep.org/
Apache License 2.0
6.49k stars 447 forks source link

A pipeline (or fanout) architecture for creating index annotations #734

Open 9nut opened 8 years ago

9nut commented 8 years ago

Based on my limited understanding (mainly reading through index package), it seems that a mechanism for applying indexers in sequence or parallel for the same file type would be useful. Here's my concrete example: suppose I want to use a computer vision library or Google Vision API to automatically create "what's in this picture" tags, in addition to what is currently being indexed.

edrex commented 8 years ago

If I understand correctly,

// TODO: make these pluggable, e.g. registered from an importer or something?

/pkg/index/corpus.go#L1173 is about allowing importers to register extractors for types they create (but maybe it's only about location info). Seems related. Update: Nope, that comment is about allowing importers to indirect location info to an associated permanode (via the foursquareVenuePermanode attr in the case of camliType: "foursquare.com:checkin" nodes).

One thing to keep in mind is that there are two ways to add annotations: as fields in the indexer, which is done for a small number of core fields (location, time, etc), or as permanode attributes. You could implement an annotation pipeline using permanode attributes as a client, out of process. The main question is how you would keep track of which annotations came from which feature extractor, so that you could rebuild them when the feature extractor changes. The extractor name and version could be added as JSON fields on the attribute claims, so that you could round them up and delete them when you want to rebuild. This is what importers do currently, but at the level of permanodes rather than attribute claims.