Closed tisnik closed 6 years ago
@msrb FYI
@miteshvp can we run the reproducer also against prod graph DB?
can we run the reproducer also against prod graph DB?
Yes definitely, but if I may ask, what is the use of this exercise?. FWIW, I've got first 50 packages in prod that are having the same issue as latest_version as ''. We need to cross-check with each of them in S3 buckets metajson for the latest version. I checked one for msv:relaxngDatatype
and version:20030807
from the file 20030807.json
and latest_version is null. So that is more an expected scenario.
Anyways, here are the first 50 packages
{
"data": ["stax:stax-ri", "msv:relaxngDatatype", "jaxme:jaxme-api", "de.wayofquality.blended:blended.util", "msv:xsdlib", "dom4j:dom4j:jar", "com.buschmais.jqassistant.core:jqassistant.core.analysis", "edu.ucar:cdm:test-jar", "org.kaazing:gateway.transport", "antlr:antlr:jar", "xerces:xerces-impl", "org.apache.oltu.oauth2:org.apache.oltu.oauth2.authzserver", "net.databinder:unfiltered_2.11", "de.schlichtherle.truezip:truezip-file:test-jar", "de.schlichtherle.truezip:truezip-swing:test-jar", "de.flapdoodle.embed:de.flapdoodle.embed.mongo:jar", "io.dropwizard.metrics:metrics-ganglia:jar", "org.apache.geronimo.specs:geronimo-jpa_2.0_spec:sources", "org.apache.geronimo.specs:geronimo-jms_1.1_spec:sources", "de.schlichtherle.truezip:truezip-path:test-jar", "com.google.javascript:closure-compiler-unshaded", "io.dropwizard.metrics:metrics-healthchecks:jar", "groovy:groovy-all:jar", "io.atlassian:kadai-logging-core_2.11", "org.jflux:org.jflux.api.data", "net.osgiliath.framework:net.osgiliath.helpers.manifest.transformer", "org.appdapter:ext.bundle.xml.dom4j_161", "org.apache.karaf:apache-karaf-minimal:tar.gz", "org.apache.geronimo.specs:geronimo-jaxws_2.2_spec", "org.apache.geronimo.specs:geronimo-atinject_1.0_spec", "org.apache.karaf.system:org.apache.karaf.system.core", "org.apache.airavata:airavata-common-utils", "org.appdapter:ext.bundle.semweb4j.jena", "com.lihaoyi:pprint_2.10", "io.dropwizard.metrics:metrics-jvm:jar", "io.fabric8:kubernetes-client:test-jar", "org.jboss.arquillian.container:arquillian-weld-ee-embedded-1.1", "org.tinygroup:org.tinygroup.exception", "com.h2database:h2:jar", "org.ow2.frascati:frascati-component-factory-juliac-tinfi-oo", "com.hazelcast:hazelcast::tests", "io.fabric8:zjsonpatch", "org.apache.karaf.itests:itests:test-jar", "com.typesafe.play:play-streams_2.11", "org.apache.aries.blueprint:org.apache.aries.blueprint.core", "org.apache.felix:org.apache.felix.scr.annotations", "org.eclipse.neoscada.base:org.eclipse.scada.sec.utils", "com.hazelcast:hazelcast-hibernate5:jar", "org.cogchar:ext.bundle.ontoware", "com.wordnik:swagger-annotations_2.10"]
}
@msrb @miteshvp we need to figure out where is the origin of this issue (data crawlers? Postgress->S3?, S3->Graph?). All packages checked by the script mostly has >= 2 versions stored in the graph DB, so the info about latest_version is availabe.
@miteshvp not having latest version seems like a bug since all packages in our DB should have at least one release.
Since @tisnik found this discrepancy in staging DB, we want to check if we have the same problem in prod as well. This looks like data ingestion bug to me.
FWIW, there is no data for stax:stax-ri
in prod S3 buckets. But they are still there in graph. What could be the reason? Second, as I mentioned I checked package-msv:relaxngDatatype version-20030807
there is latest_version as null. This said package was ingested on 2017-07-26 17:12:59.549999
. This explains to me that latest_version is not having any values based on https://github.com/fabric8-analytics/fabric8-analytics-data-model/blob/e40952bbe33f43c3a871162f40ed6785481f0387/src/graph_populator.py#L94. Although we have a check now https://github.com/fabric8-analytics/fabric8-analytics-data-model/blob/master/src/graph_populator.py#L153-L155.
@msrb - question to you, is it a good idea to replace latest_version with the version of EPV being synced if latest_version is null or not known?
EDITED: I've pasted the link of bucket separately
is it a good idea to replace latest_version with the version of EPV being synced if latest_version is null or not known?
I don't think so. We should always have the latest version. Even if package has only one release, then that release is also the latest version. I suspect this is a bug in our pipeline somewhere.
FWIW, there is no data for stax:stax-ri in prod S3 buckets. But they are still there in graph.
This is super weird. Do we have timestamps in graph? Was this package added to graph last year, pre-summit?
Do we have timestamps in graph?
2017-03-28 14:29:37.369750
@msrb last_updated property? https://github.com/fabric8-analytics/fabric8-analytics-data-model/blob/e40952bbe33f43c3a871162f40ed6785481f0387/src/graph_populator.py#L100
(btw we really need to use UTC everywhere)
Increasing severity. We need to find out where the problem is and fix it.
@msrb @miteshvp - as we bumped up this as Sev2, we need to find where the problem is and fix it.
@msrb @miteshvp - any update on this please ? are we focusing on this current sprint ? cc @krishnapaparaju
@sivaavkd - I will wait for @msrb updates here.
We talked about multiple things here. One by one:
stax:stax-ri
in prod graph, but not in S3No idea, there is no such artifact in Maven Central. Also the timestamp pointing to March 2017 as the day of ingestion is fishy as I don't think we had production deployment back then. My guess is that when we migrated to AWS, people were developing against "devel" graph instance (the only one we had back then) and the instance was later turned into staging and subsequently production one. @miteshvp am I correct that the stax:stax-ri vertex doesn't actually contain any data in graph? I don't think this is a bug in the pipeline that is currently running in production as we never call data-importer for non-existent packages. I think we should extend data-importer API so we can easily clean up weirdnesses like this one. Since @tuxdna recently synced all Maven data from S3 to graph, we could automatically delete all vertices that were not updated in past 2 months or so.
latest_version
propertylatest_version
property can be missing for many packages that were ingested around summit/summer last year and were not updated since then. AFAIK, latest_version
key was previously provided by Anitya (meta.json
on S3), but we dropped Anitya completely as it was not reliable. Vertices should now have libio_latest_version
property, latest_version
is deprecated.
Note it is possible that packages that were not updated (meaning there was no new release) in past ~6 months can still have empty latest_version
and no libio_latest_version
. If we could get the list of such packages (without libio_latest_version
property) from graph, we can "fix" them by forcing new analysis. @miteshvp Is this anything you could help with?
To summarize, I don't see any bug that we could fix here. latest_version
property should be ignored as it is deprecated and Anitya is no longer part of our deployment.
@tisnik does this answer all your questions?
TL;DR:
latest_version
property is deprecated. Please look for libio_latest_version
instead.
Feel free to reopen if I missed something here. Thanks for reporting there anomalies :wink:
The
latest_version
values for packages seems not to be consistent, at least on the Stage Graph database. Sometimes the value is set simply set to empty string.Reproducer:
Actual output:
Expected output: