openshiftio / openshift.io

Red Hat OpenShift.io is an end-to-end development environment for planning, building and deploying modern applications.
https://openshift.io
97 stars 66 forks source link

[3] Gather free text present at NPM project description page and collect keywords present #728

Closed fridex closed 7 years ago

fridex commented 7 years ago

Description

As README files are not strictly present on GitHub for each project (moreover we do not have GitHub URL for each project as it is optional), it would be good to gather as much as possible free text sources and do keywords extraction on them using existing data mining algorithm implemented in tagger.

One of candidates for NPM ecosystem is NPM's project description page - see serve-static as an example. We should aggregate only text without any HTML markup so we change code only once if the page changes over time and we do not maintain different text "parsers".

Acceptance criteria

krishnapaparaju commented 7 years ago

What are the current data sources being used for gathering NPM based tags ?

fridex commented 7 years ago

What are the current data sources being used for gathering NPM based tags ?

Generic tags gathered from StackOverflow and GitHub can be used. The plan is to implement collector for NPM tags that are present at npmjs.com (e.g. see "keywords" at https://www.npmjs.com/package/express). The work is described in https://github.com/openshiftio/openshift.io/issues/712

tuxdna commented 7 years ago

AFAIK, the keywords for NPM packages are already present in S3;

Here is an example:

https://s3.amazonaws.com/bayesian-bayesian-core-data/npm/argparse/1.0.9/metadata.json

      "keywords": [
        "cli",
        "parser",
        "argparse",
        "option",
        "args"
      ],

We can still process free text ( READMEs, description, about etc. fields ) to gather more keywords.

fridex commented 7 years ago

AFAIK, the keywords for NPM packages are already present in S3.

Yes, that's true. We extract them using mercator for a single package. The main purpose of this work item is to collect all keywords in NPM ecosystem present to this date so we can tag packages that do not have explicitly associated tags.

We can still process free text ( READMEs, description, about etc. fields ) to gather more keywords.

This task is about collecting description present on npmjs.com that we do not do right now. We already collect README files, and descriptions (if available).

fridex commented 7 years ago

Implementation sits in f8a_worker/workers/repository_description.py (some bits shared with https://github.com/openshiftio/openshift.io/issues/727), tests are ready to be merged in https://github.com/fabric8-analytics/fabric8-analytics-worker/pull/333. Code already runs in production.

fridex commented 7 years ago

Closing, as done.