fridex commented 7 years ago

Description

As README files are not strictly present on GitHub for each project (moreover we do not have GitHub URL for each project as it is optional), it would be good to gather as much as possible free text sources and do keywords extraction on them using existing data mining algorithm implemented in tagger.

One of candidates for NPM ecosystem is NPM's project description page - see serve-static as an example. We should aggregate only text without any HTML markup so we change code only once if the page changes over time and we do not maintain different text "parsers".

Acceptance criteria

[x] there is created a new Selinon task that gathers NPM description (or ideally one task that determines sources based on ecosystem) - this task is run in package-level flow
[x] newly introduced Selinon task extracts only project text - without any HTML markup and places it on S3 using Selinon S3 adapter that would be derived from already existing S3 adapter
[x] already existing keywords_tagging task is adjusted to extract keywords also from gathered NPM description and this result is placed on S3 (as now)
[x] code is tested using unit tests
[x] code runs in production and is functional

krishnapaparaju commented 7 years ago

What are the current data sources being used for gathering NPM based tags ?

fridex commented 7 years ago

What are the current data sources being used for gathering NPM based tags ?

Generic tags gathered from StackOverflow and GitHub can be used. The plan is to implement collector for NPM tags that are present at npmjs.com (e.g. see "keywords" at https://www.npmjs.com/package/express). The work is described in https://github.com/openshiftio/openshift.io/issues/712

tuxdna commented 7 years ago

AFAIK, the keywords for NPM packages are already present in S3;

Here is an example:

https://s3.amazonaws.com/bayesian-bayesian-core-data/npm/argparse/1.0.9/metadata.json

      "keywords": [
        "cli",
        "parser",
        "argparse",
        "option",
        "args"
      ],

We can still process free text ( READMEs, description, about etc. fields ) to gather more keywords.

fridex commented 7 years ago

AFAIK, the keywords for NPM packages are already present in S3.

Yes, that's true. We extract them using mercator for a single package. The main purpose of this work item is to collect all keywords in NPM ecosystem present to this date so we can tag packages that do not have explicitly associated tags.

We can still process free text ( READMEs, description, about etc. fields ) to gather more keywords.

This task is about collecting description present on npmjs.com that we do not do right now. We already collect README files, and descriptions (if available).

fridex commented 7 years ago

Implementation sits in f8a_worker/workers/repository_description.py (some bits shared with https://github.com/openshiftio/openshift.io/issues/727), tests are ready to be merged in https://github.com/fabric8-analytics/fabric8-analytics-worker/pull/333. Code already runs in production.

fridex commented 7 years ago

Closing, as done.

openshiftio / openshift.io

[3] Gather free text present at NPM project description page and collect keywords present #728

Description

Acceptance criteria