Closed msrb closed 6 years ago
This is where we try to normalize package names now: https://github.com/fabric8-analytics/fabric8-analytics-worker/blob/6b9dd98004b1fcc63ae3257cc182dc2b146b05d2/f8a_worker/workers/init_analysis_flow.py#L32
But it clearly doesn't work as expected.
PyPI package distribution is case insensitive - https://stackoverflow.com/questions/26503509/is-pypi-case-sensitive
However it is possible that in subsequent steps the package name that is used is coming from medata generated by Mercator. I will investigate where it might happen.
Encountered this error
FatalTaskError: ("No content was found at '%s' for PyPI package '%s'", 'pyjwt')
File "selinon/task_envelope.py", line 114, in run
result = task.run(node_args)
File "f8a_worker/base.py", line 54, in run
result = self.execute(node_args)
File "f8a_worker/workers/repository_description.py", line 86, in execute
return collector(self, arguments['name'])
File "f8a_worker/workers/repository_description.py", line 53, in collect_pypi
raise FatalTaskError("No content was found at '%s' for PyPI package '%s'", name)
Looks like the HTML structure of pypi.org is changed now, so the parsing fails to find repository description.
I analyzed both pyjwt/1.6.1
and PyJWT/1.6.1
.
curl -XPOST http://localhost:32000/api/v1/component-analyses/pypi/PyJWT/1.6.1
{}
curl -XPOST http://localhost:32000/api/v1/component-analyses/pypi/pyjwt/1.6.1
{}
After a while when both the ingestion completed, I see only one package and one version entry in the graph. I haven't been able to reproduce the issue with these steps.
Let me try something more.
I was testing this via jobs service.
Local setup is giving following errors with latest Docker images:
coreapi-jobs | + f8a-jobs.py initjobs
coreapi-jobs | Traceback (most recent call last):
coreapi-jobs | File "/usr/bin/f8a-jobs.py", line 11, in <module>
coreapi-jobs | from f8a_jobs.scheduler import Scheduler
coreapi-jobs | File "/usr/lib/python3.4/site-packages/f8a_jobs/scheduler.py", line 20, in <module>
coreapi-jobs | import f8a_jobs.handlers as handlers
coreapi-jobs | File "/usr/lib/python3.4/site-packages/f8a_jobs/handlers/__init__.py", line 17, in <module>
coreapi-jobs | from .nuget_popular_analyses import NugetPopularAnalyses
coreapi-jobs | File "/usr/lib/python3.4/site-packages/f8a_jobs/handlers/nuget_popular_analyses.py", line 8, in <module>
coreapi-jobs | from f8a_worker.solver import NugetReleasesFetcher
coreapi-jobs | File "/usr/lib/python3.4/site-packages/f8a_worker/solver.py", line 10, in <module>
coreapi-jobs | from pip._internal.req.req_file import parse_requirements
coreapi-jobs | ImportError: No module named 'pip._internal'
Due to above error I am not able to run ingestion on my system.
Apparently, many others are facing same issue with latest pip3 recently:
This time I used Jobs api to schedule the analyses for pyjwt
and PyJWT
, and the issue has been reproduced locally.
I can see the entries in Postgres, Minio and Graph.
The issue is with Jobs API sending parameters directly to the flow engine without performing case transformation.
Here is how to inspect this issue.
Run component analysis for some package that ensures initialization of s3 buckets for package and version data ( This is only required if you are starting afresh )
curl -XPOST "http://localhost:32000/api/v1/component-analyses/pypi/urllib3/1.22"
{}
Wait for analyses to complete.
Now invoke component analysis for pyjwt
:
curl -XPOST "http://localhost:32000/api/v1/component-analyses/pypi/pyjwt/1.6.1"
{}
Wait for analyses to complete.
Go to Jobs UI, and check for bookkeeeping data for this package
You will see all the workers which were executed for pyjwt
.
Finally run component analysis for PyJWT
:
curl -XPOST "http://localhost:32000/api/v1/component-analyses/pypi/PyJWT/1.6.1"
{}
Wait for analyses to complete.
Go to Jobs UI, and check for bookkeeeping data for this package
You will see no workers which were executed for PyJWT
.
{
"error": "No result found."
}
Now run the analyses using Jobs UI: http://localhost:34000/api/v1/jobs/selective-flow-scheduling?state=running
{
"flow_arguments": [
{
"ecosystem": "pypi",
"force": true,
"force_graph_sync": true,
"name": "pyjwt",
"recursive_limit": 0,
"version": "1.6.1"
},
{
"ecosystem": "pypi",
"force": true,
"force_graph_sync": true,
"name": "PyJWT",
"recursive_limit": 0,
"version": "1.6.1"
}
],
"flow_name": "bayesianFlow",
"run_subsequent": false,
"task_names": []
}
Wait for analyses to complete.
This time if you check for worker data for PyJWT
as in Step 2 above, you will see that there is an entry for this package, but no workers were run with PyJWT
.
{
"summary": {
"analysed_versions": [
"1.6.1"
],
"ecosystem": "pypi",
"package": "PyJWT",
"package_level_workers": [],
"package_version_count": 1
}
}
However, if you check for worker data for pyjwt
as in Step 1 above, you workers were run for this package.
This clearly means that the case normalization is happening properly at the workers level.
When invoking analyses using Component Analysis API, we will not see this bug because case normalization is handled there.
When invoking analyses using Jobs API -- i.e. directly submitting Jobs the the scheduler without first performing case normalization -- causes this issue.
Perform case normalization before submitting jobs via Jobs API or perform case normalization at the topmost node of the flow.
@humaton any update on this bug ? We are moving this Sev2 issue since some sprints cc @msrb
@sivaavkd there is a fix for it. here fabric8-analytics/fabric8-analytics-jobs/pull/287
But this issue will be present in any future code that will schedule flows directly not using jobs or server api.
Planner ID: # 2099
Description
Package names in PyPI are case insensitive, i.e.
PyJWT
andpyjwt
are the same package in PyPI world. We normalize Python package names when analysis request comes in, but later we seem to work with the original package name that was given to us by requester. This means that we can analyze the same package multiple times and thus probably end up with multiple entries in graph database.It's possible that this issue affects also other ecosystems, not just PyPI. We need to check and either fix the issue for other ecosystems as well (if easy), or create a separate issue(s) so we can tackle them later.