openshiftio / openshift.io

Red Hat OpenShift.io is an end-to-end development environment for planning, building and deploying modern applications.
https://openshift.io
97 stars 66 forks source link

The 'latest_version' property has empty value in the stage graph database #2012

Closed tisnik closed 6 years ago

tisnik commented 6 years ago

The latest_version values for packages seems not to be consistent, at least on the Stage Graph database. Sometimes the value is set simply set to empty string.

Reproducer:

import os
import sys
import requests

URL = "http://STAGE_DATABASE"

def gremlin_search_package_in_ecosystem(ecosystem, package):
    """Search packages from the selected ecosystem."""
    query = 'g.V().has("ecosystem", "{ecosystem}").has("name", "{package}")'.\
        format(ecosystem=ecosystem, package=package)
    print(query)
    data = post_query(query)
    try:
        print("*** " + package + " ***")
        assert data["result"]["data"] is not None
        properties = data["result"]["data"][0]["properties"]
        if "latest_version" in properties:
            latest_version = properties["latest_version"][0]["value"]
            if latest_version == "":
                print("latest_version: EMPTY!")
                print()
                return 1
            else:
                print("latest_version: {v}".format(v=latest_version))
                print()
                return 0
        else:
            print("latest_version attribute does not exist!!!")
            print()
            return 1
    except Exception as e:
        print("none")

def post_query(query):
    """Post the query to the Gremlin."""
    data = {"gremlin": str(query)}
    response = requests.post(URL, json=data)
    # print(response.status_code)
    data = response.json()
    return data

packages = [
    "sequence",
    "array-differ",
    "array-flatten",
    "array-reduce",
    "array-slice",
    "array-union",
    "array-uniq",
    "array-unique",
    "lodash",
    "lodash.assign",
    "lodash.assignin",
    "lodash._baseuniq",
    "lodash.bind",
    "lodash.camelcase",
    "lodash.clonedeep",
    "lodash.create",
    "lodash._createset",
    "lodash.debounce",
    "lodash.defaults",
    "lodash.filter",
    "lodash.findindex",
    "lodash.flatten",
    "lodash.foreach",
    "lodash.isplainobject",
    "lodash.mapvalues",
    "lodash.memoize",
    "lodash.mergewith",
    "lodash.once",
    "lodash.pick",
    "lodash._reescape",
    "lodash._reevaluate",
    "lodash._reinterpolate",
    "lodash.reject",
    "lodash._root",
    "lodash.some",
    "lodash.tail",
    "lodash.template",
    "lodash.union",
    "lodash.without",
    "npm",
    "underscore"
]

errors = 0
for package in packages:
    errors += gremlin_search_package_in_ecosystem("npm", package)

print("Found {n} errors in {p} packages".format(n=errors, p=len(packages)))

Actual output:

Found 25 errors in 41 packages

Expected output:

Found 0 errors in 41 packages
tisnik commented 6 years ago

@msrb FYI

msrb commented 6 years ago

@miteshvp can we run the reproducer also against prod graph DB?

miteshvp commented 6 years ago

can we run the reproducer also against prod graph DB?

Yes definitely, but if I may ask, what is the use of this exercise?. FWIW, I've got first 50 packages in prod that are having the same issue as latest_version as ''. We need to cross-check with each of them in S3 buckets metajson for the latest version. I checked one for msv:relaxngDatatype and version:20030807 from the file 20030807.json and latest_version is null. So that is more an expected scenario. Anyways, here are the first 50 packages

{
    "data": ["stax:stax-ri", "msv:relaxngDatatype", "jaxme:jaxme-api", "de.wayofquality.blended:blended.util", "msv:xsdlib", "dom4j:dom4j:jar", "com.buschmais.jqassistant.core:jqassistant.core.analysis", "edu.ucar:cdm:test-jar", "org.kaazing:gateway.transport", "antlr:antlr:jar", "xerces:xerces-impl", "org.apache.oltu.oauth2:org.apache.oltu.oauth2.authzserver", "net.databinder:unfiltered_2.11", "de.schlichtherle.truezip:truezip-file:test-jar", "de.schlichtherle.truezip:truezip-swing:test-jar", "de.flapdoodle.embed:de.flapdoodle.embed.mongo:jar", "io.dropwizard.metrics:metrics-ganglia:jar", "org.apache.geronimo.specs:geronimo-jpa_2.0_spec:sources", "org.apache.geronimo.specs:geronimo-jms_1.1_spec:sources", "de.schlichtherle.truezip:truezip-path:test-jar", "com.google.javascript:closure-compiler-unshaded", "io.dropwizard.metrics:metrics-healthchecks:jar", "groovy:groovy-all:jar", "io.atlassian:kadai-logging-core_2.11", "org.jflux:org.jflux.api.data", "net.osgiliath.framework:net.osgiliath.helpers.manifest.transformer", "org.appdapter:ext.bundle.xml.dom4j_161", "org.apache.karaf:apache-karaf-minimal:tar.gz", "org.apache.geronimo.specs:geronimo-jaxws_2.2_spec", "org.apache.geronimo.specs:geronimo-atinject_1.0_spec", "org.apache.karaf.system:org.apache.karaf.system.core", "org.apache.airavata:airavata-common-utils", "org.appdapter:ext.bundle.semweb4j.jena", "com.lihaoyi:pprint_2.10", "io.dropwizard.metrics:metrics-jvm:jar", "io.fabric8:kubernetes-client:test-jar", "org.jboss.arquillian.container:arquillian-weld-ee-embedded-1.1", "org.tinygroup:org.tinygroup.exception", "com.h2database:h2:jar", "org.ow2.frascati:frascati-component-factory-juliac-tinfi-oo", "com.hazelcast:hazelcast::tests", "io.fabric8:zjsonpatch", "org.apache.karaf.itests:itests:test-jar", "com.typesafe.play:play-streams_2.11", "org.apache.aries.blueprint:org.apache.aries.blueprint.core", "org.apache.felix:org.apache.felix.scr.annotations", "org.eclipse.neoscada.base:org.eclipse.scada.sec.utils", "com.hazelcast:hazelcast-hibernate5:jar", "org.cogchar:ext.bundle.ontoware", "com.wordnik:swagger-annotations_2.10"]
}
tisnik commented 6 years ago

@msrb @miteshvp we need to figure out where is the origin of this issue (data crawlers? Postgress->S3?, S3->Graph?). All packages checked by the script mostly has >= 2 versions stored in the graph DB, so the info about latest_version is availabe.

msrb commented 6 years ago

@miteshvp not having latest version seems like a bug since all packages in our DB should have at least one release.

Since @tisnik found this discrepancy in staging DB, we want to check if we have the same problem in prod as well. This looks like data ingestion bug to me.

miteshvp commented 6 years ago

FWIW, there is no data for stax:stax-ri in prod S3 buckets. But they are still there in graph. What could be the reason? Second, as I mentioned I checked package-msv:relaxngDatatype version-20030807 there is latest_version as null. This said package was ingested on 2017-07-26 17:12:59.549999. This explains to me that latest_version is not having any values based on https://github.com/fabric8-analytics/fabric8-analytics-data-model/blob/e40952bbe33f43c3a871162f40ed6785481f0387/src/graph_populator.py#L94. Although we have a check now https://github.com/fabric8-analytics/fabric8-analytics-data-model/blob/master/src/graph_populator.py#L153-L155. @msrb - question to you, is it a good idea to replace latest_version with the version of EPV being synced if latest_version is null or not known? EDITED: I've pasted the link of bucket separately

msrb commented 6 years ago

is it a good idea to replace latest_version with the version of EPV being synced if latest_version is null or not known?

I don't think so. We should always have the latest version. Even if package has only one release, then that release is also the latest version. I suspect this is a bug in our pipeline somewhere.

msrb commented 6 years ago

FWIW, there is no data for stax:stax-ri in prod S3 buckets. But they are still there in graph.

This is super weird. Do we have timestamps in graph? Was this package added to graph last year, pre-summit?

miteshvp commented 6 years ago

Do we have timestamps in graph?

2017-03-28 14:29:37.369750

tisnik commented 6 years ago

@msrb last_updated property? https://github.com/fabric8-analytics/fabric8-analytics-data-model/blob/e40952bbe33f43c3a871162f40ed6785481f0387/src/graph_populator.py#L100

(btw we really need to use UTC everywhere)

msrb commented 6 years ago

Increasing severity. We need to find out where the problem is and fix it.

sivaavkd commented 6 years ago

@msrb @miteshvp - as we bumped up this as Sev2, we need to find where the problem is and fix it.

sivaavkd commented 6 years ago

@msrb @miteshvp - any update on this please ? are we focusing on this current sprint ? cc @krishnapaparaju

miteshvp commented 6 years ago

@sivaavkd - I will wait for @msrb updates here.

msrb commented 6 years ago

We talked about multiple things here. One by one:

No idea, there is no such artifact in Maven Central. Also the timestamp pointing to March 2017 as the day of ingestion is fishy as I don't think we had production deployment back then. My guess is that when we migrated to AWS, people were developing against "devel" graph instance (the only one we had back then) and the instance was later turned into staging and subsequently production one. @miteshvp am I correct that the stax:stax-ri vertex doesn't actually contain any data in graph? I don't think this is a bug in the pipeline that is currently running in production as we never call data-importer for non-existent packages. I think we should extend data-importer API so we can easily clean up weirdnesses like this one. Since @tuxdna recently synced all Maven data from S3 to graph, we could automatically delete all vertices that were not updated in past 2 months or so.

latest_version property can be missing for many packages that were ingested around summit/summer last year and were not updated since then. AFAIK, latest_version key was previously provided by Anitya (meta.json on S3), but we dropped Anitya completely as it was not reliable. Vertices should now have libio_latest_version property, latest_version is deprecated. Note it is possible that packages that were not updated (meaning there was no new release) in past ~6 months can still have empty latest_version and no libio_latest_version. If we could get the list of such packages (without libio_latest_version property) from graph, we can "fix" them by forcing new analysis. @miteshvp Is this anything you could help with?

To summarize, I don't see any bug that we could fix here. latest_version property should be ignored as it is deprecated and Anitya is no longer part of our deployment.

@tisnik does this answer all your questions?

msrb commented 6 years ago

TL;DR:

latest_version property is deprecated. Please look for libio_latest_version instead.

Feel free to reopen if I missed something here. Thanks for reporting there anomalies :wink: