openshiftio / openshift.io

Red Hat OpenShift.io is an end-to-end development environment for planning, building and deploying modern applications.
https://openshift.io
97 stars 66 forks source link

[f8a] It's possible to ingest the same package multiple times, under different names #3339

Closed msrb closed 6 years ago

msrb commented 6 years ago

Planner ID: # 2099

Description

Package names in PyPI are case insensitive, i.e. PyJWT and pyjwt are the same package in PyPI world. We normalize Python package names when analysis request comes in, but later we seem to work with the original package name that was given to us by requester. This means that we can analyze the same package multiple times and thus probably end up with multiple entries in graph database.

It's possible that this issue affects also other ecosystems, not just PyPI. We need to check and either fix the issue for other ecosystems as well (if easy), or create a separate issue(s) so we can tackle them later.

msrb commented 6 years ago

This is where we try to normalize package names now: https://github.com/fabric8-analytics/fabric8-analytics-worker/blob/6b9dd98004b1fcc63ae3257cc182dc2b146b05d2/f8a_worker/workers/init_analysis_flow.py#L32

But it clearly doesn't work as expected.

tuxdna commented 6 years ago

PyPI package distribution is case insensitive - https://stackoverflow.com/questions/26503509/is-pypi-case-sensitive

However it is possible that in subsequent steps the package name that is used is coming from medata generated by Mercator. I will investigate where it might happen.

tuxdna commented 6 years ago

Encountered this error

FatalTaskError: ("No content was found at '%s' for PyPI package '%s'", 'pyjwt')
  File "selinon/task_envelope.py", line 114, in run
    result = task.run(node_args)
  File "f8a_worker/base.py", line 54, in run
    result = self.execute(node_args)
  File "f8a_worker/workers/repository_description.py", line 86, in execute
    return collector(self, arguments['name'])
  File "f8a_worker/workers/repository_description.py", line 53, in collect_pypi
    raise FatalTaskError("No content was found at '%s' for PyPI package '%s'", name)

Looks like the HTML structure of pypi.org is changed now, so the parsing fails to find repository description.

Here: https://github.com/fabric8-analytics/fabric8-analytics-worker/blob/6b9dd98004b1fcc63ae3257cc182dc2b146b05d2/f8a_worker/workers/repository_description.py#L50

tuxdna commented 6 years ago

I analyzed both pyjwt/1.6.1 and PyJWT/1.6.1.

curl -XPOST http://localhost:32000/api/v1/component-analyses/pypi/PyJWT/1.6.1
{}
curl -XPOST http://localhost:32000/api/v1/component-analyses/pypi/pyjwt/1.6.1
{}

After a while when both the ingestion completed, I see only one package and one version entry in the graph. I haven't been able to reproduce the issue with these steps.

Let me try something more.

msrb commented 6 years ago

I was testing this via jobs service.

tuxdna commented 6 years ago

Local setup is giving following errors with latest Docker images:

coreapi-jobs            | + f8a-jobs.py initjobs
coreapi-jobs            | Traceback (most recent call last):
coreapi-jobs            |   File "/usr/bin/f8a-jobs.py", line 11, in <module>
coreapi-jobs            |     from f8a_jobs.scheduler import Scheduler
coreapi-jobs            |   File "/usr/lib/python3.4/site-packages/f8a_jobs/scheduler.py", line 20, in <module>
coreapi-jobs            |     import f8a_jobs.handlers as handlers
coreapi-jobs            |   File "/usr/lib/python3.4/site-packages/f8a_jobs/handlers/__init__.py", line 17, in <module>
coreapi-jobs            |     from .nuget_popular_analyses import NugetPopularAnalyses
coreapi-jobs            |   File "/usr/lib/python3.4/site-packages/f8a_jobs/handlers/nuget_popular_analyses.py", line 8, in <module>
coreapi-jobs            |     from f8a_worker.solver import NugetReleasesFetcher
coreapi-jobs            |   File "/usr/lib/python3.4/site-packages/f8a_worker/solver.py", line 10, in <module>
coreapi-jobs            |     from pip._internal.req.req_file import parse_requirements
coreapi-jobs            | ImportError: No module named 'pip._internal'

Due to above error I am not able to run ingestion on my system.

Apparently, many others are facing same issue with latest pip3 recently:

tuxdna commented 6 years ago

This time I used Jobs api to schedule the analyses for pyjwt and PyJWT, and the issue has been reproduced locally.

I can see the entries in Postgres, Minio and Graph.

tuxdna commented 6 years ago

The issue is with Jobs API sending parameters directly to the flow engine without performing case transformation.

Here is how to inspect this issue.

Step 0

Run component analysis for some package that ensures initialization of s3 buckets for package and version data ( This is only required if you are starting afresh )

curl -XPOST "http://localhost:32000/api/v1/component-analyses/pypi/urllib3/1.22"
{}

Wait for analyses to complete.

Step 1

Now invoke component analysis for pyjwt:

curl -XPOST "http://localhost:32000/api/v1/component-analyses/pypi/pyjwt/1.6.1"
{}

Wait for analyses to complete.

Go to Jobs UI, and check for bookkeeeping data for this package

You will see all the workers which were executed for pyjwt.

Step 2

Finally run component analysis for PyJWT:

curl -XPOST "http://localhost:32000/api/v1/component-analyses/pypi/PyJWT/1.6.1"
{}

Wait for analyses to complete.

Go to Jobs UI, and check for bookkeeeping data for this package

You will see no workers which were executed for PyJWT.

{
  "error": "No result found."
}

Step 3

Now run the analyses using Jobs UI: http://localhost:34000/api/v1/jobs/selective-flow-scheduling?state=running

{
  "flow_arguments": [
    {
      "ecosystem": "pypi",
      "force": true,
      "force_graph_sync": true,
      "name": "pyjwt",
      "recursive_limit": 0,
      "version": "1.6.1"
    },
    {
      "ecosystem": "pypi",
      "force": true,
      "force_graph_sync": true,
      "name": "PyJWT",
      "recursive_limit": 0,
      "version": "1.6.1"
    }
  ],
  "flow_name": "bayesianFlow",
  "run_subsequent": false,
  "task_names": []
}

Wait for analyses to complete.

This time if you check for worker data for PyJWT as in Step 2 above, you will see that there is an entry for this package, but no workers were run with PyJWT.

{
  "summary": {
    "analysed_versions": [
      "1.6.1"
    ],
    "ecosystem": "pypi",
    "package": "PyJWT",
    "package_level_workers": [],
    "package_version_count": 1
  }
}

However, if you check for worker data for pyjwt as in Step 1 above, you workers were run for this package.

This clearly means that the case normalization is happening properly at the workers level.

Summary

Solution

Perform case normalization before submitting jobs via Jobs API or perform case normalization at the topmost node of the flow.

sivaavkd commented 6 years ago

@humaton any update on this bug ? We are moving this Sev2 issue since some sprints cc @msrb

humaton commented 6 years ago

@sivaavkd there is a fix for it. here fabric8-analytics/fabric8-analytics-jobs/pull/287

But this issue will be present in any future code that will schedule flows directly not using jobs or server api.