openshiftio / openshift.io

Red Hat OpenShift.io is an end-to-end development environment for planning, building and deploying modern applications.
https://openshift.io
97 stars 66 forks source link

Sync Maven data to graph #2048

Closed msrb closed 6 years ago

msrb commented 6 years ago

Description

There were times when data ingestion pipeline was broken or certain parts of the pipeline were disabled. During such times, we analyzed plenty of packages and stored results in S3, but never ingested the data to graph database. With https://github.com/openshiftio/openshift.io/issues/1085 implemented, we want to sync all missing data from S3 to graph.

Acceptance criteria

abs51295 commented 6 years ago

Fixed a typo @msrb :)

msrb commented 6 years ago

@sivaavkd description updated, is it better now?

tuxdna commented 6 years ago

Carrying forward from - https://github.com/openshiftio/openshift.io/issues/1085#issuecomment-361237441

Some intermittent failures encountered in Graph layer. This seems to be happening very sparsely:


g.V().has('ecosystem','maven').has('name','org.apache.tomcat:tomcat-servlet-api').properties('tokens','libio_usedby').drop().iterate();pkg = g.V().has('ecosystem','maven').has('name', 'org.apache.tomcat:tomcat-servlet-api').tryNext().orElseGet{graph.addVertex('ecosystem', 'maven', 'name', 'org.apache.tomcat:tomcat-servlet-api', 'vertex_label', 'Package')};pkg.property('last_updated', 1517286212.43);pkg.property('tokens', 'org'); pkg.property('tokens', 'apache'); pkg.property('tokens', 'tomcat'); pkg.property('tokens', 'tomcat'); pkg.property('tokens', 'servlet'); pkg.property('tokens', 'api');pkg.property('latest_version', '9.0.0.M17');pkg.property('libio_latest_release', '1500854400.0');pkg.property('libio_usedby', 'keycloak/keycloak:1359');pkg.property('libio_usedby', 'SungardAS/enhanced-snapshots:35');pkg.property('libio_usedby', 'cf-unik/unik:1239');pkg.property('libio_usedby', 'entando/entando-components:15');pkg.property('libio_usedby', 'indeedeng/proctor:198');pkg.property('libio_usedby', 'magro/memcached-session-manager:552');pkg.property('libio_usedby', 'nysenate/OpenLegislation:163');pkg.property('libio_usedby', 'Red5/red5-server:1254');pkg.property('libio_usedby', 'aspose-words/Aspose.Words-for-Java:52');pkg.property('libio_usedby', 'google/identity-toolkit-java-client:32');pkg.property('libio_dependents_projects', '65');pkg.property('libio_dependents_repos', '2.05K');pkg.property('libio_total_releases', '124');pkg.property('libio_latest_version', '8.5.19');g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.19').property('gh_release_date', 1500854400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M25').property('gh_release_date',1500854400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M19').property('gh_release_date',1490572800.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.
5.16').property('gh_release_date',1498003200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.14').property('gh_release_date',1492041600.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.15').property('gh_release_date',1493942400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M21').property('gh_release_date',1493856000.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M20').property('gh_release_date',1491955200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M22').property('gh_release_date',1498003200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.13').property('gh_release_date',1490572800.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M1').properties('licenses','cve_ids','declared_licenses').drop().iterate();ver = g.V().has('pecosystem', 'maven').has('pname', 'org.apache.tomcat:tomcat-servlet-api').has('version', '9.0.0.M1').tryNext().orElseGet{graph.addVertex('pecosystem','maven', 'pname','org.apache.tomcat:tomcat-servlet-api', 'version', '9.0.0.M1', 'vertex_label', 'Version')};ver.property('last_updated',1517286212.43);ver.property('description','javax.servlet package');ver.property('cm_num_files',112);ver.property('cm_avg_cyclomatic_complexity', 1.23);ver.property('cm_loc',42622);ver.property('licenses', 'ASL 2.0'); ver.property('licenses', 'CDDL');ver.property('cve_ids', 'CVE-2017-6056:5.0'); ver.property('cve_ids', 'CVE-2016-8735:7.5'); ver.property('cve_ids', 'CVE-2016-6816:6.8'); ver.property('cve_ids', 'CVE-2016-6325:7.2'); ver.property('cve_ids', 'CVE-2016-5425:7.2'); ver.property('cve_ids', 'CVE-2016-3092:7.8'); ver.property('cve_ids', 'CVE-2016-0763:6.5'); ver.property('cve_ids', 'CVE-2016-0714:6.
5'); ver.property('cve_ids', 'CVE-2016-0706:4.0'); ver.property('cve_ids', 'CVE-2015-5351:6.8'); ver.property('cve_ids', 'CVE-2015-5346:6.8'); ver.property('cve_ids', 'CVE-2015-5345:5.0');ver.property('declared_licenses', 'Apache License'); ver.property('declared_licenses', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0');lic = g.V().has('lname', 'Apache License').tryNext().orElseGet{graph.addVertex('vertex_label', 'License', 'lname', 'Apache License', 'last_updated',1517286212.43)}; g.V(ver).out('has_declared_license').has('lname', 'Apache License').tryNext().orElseGet{ver.addEdge('has_declared_license', lic)};lic = g.V().has('lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0').tryNext().orElseGet{graph.addVertex('vertex_label', 'License', 'lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0', 'last_updated',1517286212.43)}; g.V(ver).out('has_declared_license').has('lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0').tryNext().orElseGet{ver.addEdge('has_declared_license', lic)};edge_c = g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M1').in('has_version').tryNext().orElseGet{pkg.addEdge('has_version', ver)};
ERROR:data_importer:The import failed: 'status'
ERROR:data_importer:Traceback for latest failure in import call: Traceback (most recent call last):
  File "/src/data_importer.py", line 101, in _import_keys_from_s3_http
    if resp['status']['code'] == 200:
KeyError: 'status'

g.V().has('ecosystem','maven').has('name','org.apache.tomcat:tomcat-servlet-api').properties('tokens','libio_usedby').drop().iterate();pkg = g.V().has('ecosystem','maven').has('name', 'org.apache.tomcat:tomcat-servlet-api').tryNext().orElseGet{graph.addVertex('ecosystem', 'maven', 'name', 'org.apache.tomcat:tomcat-servlet-api', 'vertex_label', 'Package')};pkg.property('last_updated', 1517286213.63);pkg.property('tokens', 'org'); pkg.property('tokens', 'apache'); pkg.property('tokens', 'tomcat'); pkg.property('tokens', 'tomcat'); pkg.property('tokens', 'servlet'); pkg.property('tokens', 'api');pkg.property('latest_version', '9.0.0.M17');pkg.property('libio_latest_release', '1500854400.0');pkg.property('libio_usedby', 'keycloak/keycloak:1359');pkg.property('libio_usedby', 'SungardAS/enhanced-snapshots:35');pkg.property('libio_usedby', 'cf-unik/unik:1239');pkg.property('libio_usedby', 'entando/entando-components:15');pkg.property('libio_usedby', 'indeedeng/proctor:198');pkg.property('libio_usedby', 'magro/memcached-session-manager:552');pkg.property('libio_usedby', 'nysenate/OpenLegislation:163');pkg.property('libio_usedby', 'Red5/red5-server:1254');pkg.property('libio_usedby', 'aspose-words/Aspose.Words-for-Java:52');pkg.property('libio_usedby', 'google/identity-toolkit-java-client:32');pkg.property('libio_dependents_projects', '65');pkg.property('libio_dependents_repos', '2.05K');pkg.property('libio_total_releases', '124');pkg.property('libio_latest_version', '8.5.19');g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.19').property('gh_release_date', 1500854400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M25').property('gh_release_date',1500854400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M19').property('gh_release_date',1490572800.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.
5.16').property('gh_release_date',1498003200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.14').property('gh_release_date',1492041600.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.15').property('gh_release_date',1493942400.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M21').property('gh_release_date',1493856000.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M20').property('gh_release_date',1491955200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M22').property('gh_release_date',1498003200.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.13').property('gh_release_date',1490572800.0);g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M11').properties('licenses','cve_ids','declared_licenses').drop().iterate();ver = g.V().has('pecosystem', 'maven').has('pname', 'org.apache.tomcat:tomcat-servlet-api').has('version', '9.0.0.M11').tryNext().orElseGet{graph.addVertex('pecosystem','maven', 'pname','org.apache.tomcat:tomcat-servlet-api', 'version', '9.0.0.M11', 'vertex_label', 'Version')};ver.property('last_updated',1517286213.63);ver.property('description','javax.servlet package');ver.property('cm_num_files',114);ver.property('cm_avg_cyclomatic_complexity', 1.23);ver.property('cm_loc',42800);ver.property('licenses', 'ASL 2.0'); ver.property('licenses', 'CDDL');ver.property('cve_ids', 'CVE-2017-6056:5.0'); ver.property('cve_ids', 'CVE-2016-8747:5.0'); ver.property('cve_ids', 'CVE-2016-8735:7.5'); ver.property('cve_ids', 'CVE-2016-6816:6.8'); ver.property('cve_ids', 'CVE-2016-6325:7.2'); ver.property('cve_ids', 'CVE-2016-5425:7.2');ver.property('declared_licenses', 'Apache License'); ver.property('declared_licenses'
, ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0');lic = g.V().has('lname', 'Apache License').tryNext().orElseGet{graph.addVertex('vertex_label', 'License', 'lname', 'Apache License', 'last_updated',1517286213.63)}; g.V(ver).out('has_declared_license').has('lname', 'Apache License').tryNext().orElseGet{ver.addEdge('has_declared_license', lic)};lic = g.V().has('lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0').tryNext().orElseGet{graph.addVertex('vertex_label', 'License', 'lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0', 'last_updated',1517286213.63)}; g.V(ver).out('has_declared_license').has('lname', ' Version 2.0 and
        Common Development And Distribution License (CDDL) Version 1.0').tryNext().orElseGet{ver.addEdge('has_declared_license', lic)};edge_c = g.V().has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','9.0.0.M11').in('has_version').tryNext().orElseGet{pkg.addEdge('has_version', ver)};
ERROR:data_importer:The import failed: HTTPConnectionPool(host='172.30.80.86', port=8182): Read timed out. (read timeout=30)
ERROR:data_importer:Traceback for latest failure in import call: Traceback (most recent call last):
  File "/src/data_importer.py", line 98, in _import_keys_from_s3_http
    data=json.dumps(payload), timeout=30)
  File "/usr/lib/python2.7/site-packages/requests/api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 521, in send
    raise ReadTimeout(e, request=request)
ReadTimeout: HTTPConnectionPool(host='172.30.80.86', port=8182): Read timed out. (read timeout=30)
tuxdna commented 6 years ago

Figured out the cause of above error from gremlin server logs:

39989555 [gremlin-server-worker-1] WARN  org.apache.tinkerpop.gremlin.server.handler.HttpGremlinEndpointHandler  - Invalid request - responding with 500 Internal Server Error and startup failed:
Script3023.groovy: 1: expecting ''', found '\n' @ line 1, column 4116.
   d_licenses', ' Version 2.0 and
                                 ^

1 error

org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
Script3023.groovy: 1: expecting ''', found '\n' @ line 1, column 4116.
   d_licenses', ' Version 2.0 and
                                 ^

1 error

    at org.codehaus.groovy.control.ErrorCollector.failIfErrors(ErrorCollector.java:310)
    at org.codehaus.groovy.control.ErrorCollector.addFatalError(ErrorCollector.java:150)
    at org.codehaus.groovy.control.ErrorCollector.addError(ErrorCollector.java:120)
    at org.codehaus.groovy.control.ErrorCollector.addError(ErrorCollector.java:132)
    at org.codehaus.groovy.control.SourceUnit.addError(SourceUnit.java:360)
    at org.codehaus.groovy.antlr.AntlrParserPlugin.transformCSTIntoAST(AntlrParserPlugin.java:140)
    at org.codehaus.groovy.antlr.AntlrParserPlugin.parseCST(AntlrParserPlugin.java:111)
    at org.codehaus.groovy.control.SourceUnit.parse(SourceUnit.java:237)
    at org.codehaus.groovy.control.CompilationUnit$1.call(CompilationUnit.java:167)
    at org.codehaus.groovy.control.CompilationUnit.applyToSourceUnits(CompilationUnit.java:931)
    at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:593)
    at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:569)
    at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:546)
    at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:298)
    at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:268)
    at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:254)
    at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:211)
    at org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.getScriptClass(GremlinGroovyScriptEngine.java:527)
    at org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.eval(GremlinGroovyScriptEngine.java:446)
    at javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:233)
    at org.apache.tinkerpop.gremlin.groovy.engine.ScriptEngines.eval(ScriptEngines.java:119)
    at org.apache.tinkerpop.gremlin.groovy.engine.GremlinExecutor.lambda$eval$2(GremlinExecutor.java:287)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Essentially the queries are not formed should factor in special characters like newlines correctly in existing code here:

https://github.com/fabric8-analytics/fabric8-analytics-data-model/blame/51ae4e1ecbd03e352a1a8e63a003bf2d7264a1a1/src/graph_populator.py#L112-L124

                drop_props.append('declared_licenses')

                prp_version += " ".join(["ver.property('declared_licenses', '{}');".format
                                         (dl) for dl in declared_licenses])
                # Create License Node and edge from EPV
                for lic in declared_licenses:
                    prp_version += "lic = g.V().has('lname', '{lic}').tryNext().orElseGet{{" \
                                   "graph.addVertex('vertex_label', 'License', 'lname', '{lic}', " \
                                   "'last_updated',{last_updated})}}; g.V(ver).out(" \
                                   "'has_declared_license').has('lname', '{lic}').tryNext()." \
                                   "orElseGet{{ver.addEdge('has_declared_license', lic)}};".format(
                                       lic=lic, last_updated=str(time.time())
                                   )

Happens for this package

has('pecosystem','maven').has('pname','org.apache.tomcat:tomcat-servlet-api').has('version','8.5.14')
tuxdna commented 6 years ago

metadata.json for the above EPV contains

"declared_license": "Apache License, Version 2.0 and\n        Common Development And Distribution License (CDDL) Version 1.0",

Notice the newline.

tuxdna commented 6 years ago

Fixed issue -

Another issues encountered

tuxdna commented 6 years ago

More issues encountered

tuxdna commented 6 years ago

Last time I checked the progress of Maven graph sync, it went till package named org.ops4j.pax.exam:pax-exam-spi which is ranked 104495 out of total 131460 Maven packages ( considering lexicographic order by name ). This is about 79% of all Maven packages which were synced.

msrb commented 6 years ago

@miteshvp can we query graph for exact numbers of Maven packages/components please?

tuxdna commented 6 years ago

In the first pass, less than 11402 packages remain from Maven graph sync. This is

100 * (1 - 11402 / 131460) = 91.32 %

There could be some packages skipped due to Gateway Timeouts. We can sync those once first pass is complete.

tuxdna commented 6 years ago

First pass is complete. Many of the Maven packages failed to sync in graph due to multiple reasons:

I have scheduled the sync of those (pending) packages again now.

msrb commented 6 years ago

Thanks Saleem 👍

I've created https://github.com/openshiftio/openshift.io/issues/2256 for improving test coverage in data-importer.

miteshvp commented 6 years ago

Late reply but here it is - there are total 137389 Maven packages in our graph

tuxdna commented 6 years ago

@miteshvp How did you figure this out ?

msrb commented 6 years ago

Maven is in graph, closing. Thanks @tuxdna :wink: