Closed fridex closed 7 years ago
What are the current data sources being used for gathering NPM based tags ?
What are the current data sources being used for gathering NPM based tags ?
Generic tags gathered from StackOverflow and GitHub can be used. The plan is to implement collector for NPM tags that are present at npmjs.com (e.g. see "keywords" at https://www.npmjs.com/package/express). The work is described in https://github.com/openshiftio/openshift.io/issues/712
AFAIK, the keywords for NPM packages are already present in S3;
Here is an example:
https://s3.amazonaws.com/bayesian-bayesian-core-data/npm/argparse/1.0.9/metadata.json
"keywords": [
"cli",
"parser",
"argparse",
"option",
"args"
],
We can still process free text ( READMEs, description, about etc. fields ) to gather more keywords.
AFAIK, the keywords for NPM packages are already present in S3.
Yes, that's true. We extract them using mercator for a single package. The main purpose of this work item is to collect all keywords in NPM ecosystem present to this date so we can tag packages that do not have explicitly associated tags.
We can still process free text ( READMEs, description, about etc. fields ) to gather more keywords.
This task is about collecting description present on npmjs.com that we do not do right now. We already collect README files, and descriptions (if available).
Implementation sits in f8a_worker/workers/repository_description.py (some bits shared with https://github.com/openshiftio/openshift.io/issues/727), tests are ready to be merged in https://github.com/fabric8-analytics/fabric8-analytics-worker/pull/333. Code already runs in production.
Closing, as done.
Description
As README files are not strictly present on GitHub for each project (moreover we do not have GitHub URL for each project as it is optional), it would be good to gather as much as possible free text sources and do keywords extraction on them using existing data mining algorithm implemented in tagger.
One of candidates for NPM ecosystem is NPM's project description page - see serve-static as an example. We should aggregate only text without any HTML markup so we change code only once if the page changes over time and we do not maintain different text "parsers".
Acceptance criteria
keywords_tagging
task is adjusted to extract keywords also from gathered NPM description and this result is placed on S3 (as now)