These collectors should be run periodically and automatically gather tags from external resources. Probably the best start would be to create tasks for gathering tags that would use tagger library API to call tag/keyword collectors. These tasks could be grouped to Selinon flow which would be periodically run from jobs service based on YAML configuration.
Topics should be gathered and placed on GitHub tags repo or S3 to automatically use them for feeding PGM model. As of now topics are stored on GitHub, but it is probably not that suitable for tagging pipeline - discuss possible movement to S3 (with implementation of appropriate topics S3 adapter in analytics core).
One of the last tasks in topics gathering job should aggregate topics to compute synonyms and drop less relevant keywords. This should be done using tagger library API call and should be transparently configurable on source code level as there are expectations to tweak parameters based on overall tagging results. Topics aggregating should be done per ecosystem (ecosystem specifc tags) and each ecosytem specific aggregation should be done in a separate task.
Acceptance criteria
[ ] group each tagger collector library call to a Selinon task - one task per collector type
[ ] create one Selinon flow with all collectors - e.g. collectingTagsFlow
[ ] if more suitable, move gathered and aggregated topics to S3 with appropriate Selinon storage adapter implementation by deriving from existing S3 adapter
[ ] make sure topics are aggregated per ecosystem
[ ] make sure tags gathering job is periodically run in production and fully functional
Description
In order to create fully automatic tagging pipeline, we would like to automatically collect tags for various ecosystems. Some of these collectors are implemented in tagger, some of them are planned to be implemented (e.g. https://github.com/openshiftio/openshift.io/issues/712, https://github.com/openshiftio/openshift.io/issues/710).
These collectors should be run periodically and automatically gather tags from external resources. Probably the best start would be to create tasks for gathering tags that would use tagger library API to call tag/keyword collectors. These tasks could be grouped to Selinon flow which would be periodically run from jobs service based on YAML configuration.
Topics should be gathered and placed on GitHub tags repo or S3 to automatically use them for feeding PGM model. As of now topics are stored on GitHub, but it is probably not that suitable for tagging pipeline - discuss possible movement to S3 (with implementation of appropriate topics S3 adapter in analytics core).
One of the last tasks in topics gathering job should aggregate topics to compute synonyms and drop less relevant keywords. This should be done using tagger library API call and should be transparently configurable on source code level as there are expectations to tweak parameters based on overall tagging results. Topics aggregating should be done per ecosystem (ecosystem specifc tags) and each ecosytem specific aggregation should be done in a separate task.
Acceptance criteria
collectingTagsFlow