openshiftio / openshift.io

Red Hat OpenShift.io is an end-to-end development environment for planning, building and deploying modern applications.
https://openshift.io
97 stars 66 forks source link

[13] Perform integration tests for data ingestion related components of OSIO analytics #1329

Open krishnapaparaju opened 7 years ago

krishnapaparaju commented 7 years ago

User story

As a fabric8-analytics developer, I want to run all available integration/E2E tests for data ingestion part of the pipeline on every merge to master branch. This will help me to catch bugs early so I can fix them before they get promoted to production.

Description

Currently integration tests are either not complete, not enabled for components involved in data gathering ingestion parts of OSIO analytics architecture. Starting point of these integration tests would be ingestion of data from public sources and end point would be landing the processed data either at S3 / Graph (or both of these destinations).

Note the tests for data ingestion part of the pipeline already exist, but they are not enabled in CI as it is missing credentials that are needed for running the tests.

Acceptance criteria

Tasks

Work in progress

x

Done

msrb commented 7 years ago

Current status:

Note I suspect that testing whether data landed in graph might be tricky as the tests run staging environment where graph is already populated. cc @tisnik

miteshvp commented 7 years ago

Note I suspect that testing whether data landed in graph might be tricky

We can use last_updated key for Package and Version. It is time elapsed in seconds since epoch. We can check if the last_updated > our expected time (submission time) EDITED - last_updated is maintained at both Package and Version node

tisnik commented 7 years ago

Note I suspect that testing whether data landed in graph might be tricky

that brings another question - how well is this part checked by unit tests? Could we rely on existing internal API?

tisnik commented 6 years ago

@miteshvp in order to finish this task I'd need to know the structure of data stored in our graph DB. Could you please point me to (any) documentation about this topic? Thank you in advance!

miteshvp commented 6 years ago

@tisnik - we do not have any documentation around the structure. I am pasting a response for your use. Let me know if you need more information. Package - io.vertx:vertx-core

{
    "requestId": "ea0d940c-bb7a-45b1-8cd7-aad88b0976e0",
    "status": {
        "message": "",
        "code": 200,
        "attributes": {}
    },
    "result": {
        "data": [{
            "gh_issues_last_month_opened": [-1],
            "gh_prs_last_year_closed": [-1],
            "libio_usedby": ["TechEmpower/FrameworkBenchmarks:2891", "apiman/apiman:345", "boonproject/boon:473", "hawkular/hawkular-apm:132", "isaiah/jubilee:342", "jbosstm/narayana:76", "jhalterman/failsafe:1795", "vert-x3/vertx-stack:78", "wildfly-swarm/wildfly-swarm:190", "wisdom-framework/wisdom:72"],
            "ecosystem": ["maven"],
            "gh_subscribers_count": [570],
            "gh_contributors_count": [30],
            "vertex_label": ["Package"],
            "libio_dependents_repos": ["4.75K"],
            "last_updated_sentiment_score": ["2017-10-09"],
            "sentiment_magnitude": ["0"],
            "gh_issues_last_year_opened": [-1],
            "gh_issues_last_month_closed": [-1],
            "gh_open_issues_count": [184],
            "libio_dependents_projects": ["128"],
            "latest_version": ["3.4.1"],
            "tokens": ["core", "io", "vertx"],
            "package_relative_used": ["not used"],
            "gh_stargazers": [6946],
            "gh_forks": [1274],
            "package_dependents_count": [-1],
            "gh_prs_last_month_opened": [-1],
            "gh_issues_last_year_closed": [-1],
            "sentiment_score": ["0"],
            "last_updated": [1.51178887579E9],
            "gh_prs_last_month_closed": [-1],
            "libio_total_releases": ["48"],
            "gh_prs_last_year_opened": [-1],
            "name": ["io.vertx:vertx-core"],
            "libio_latest_version": ["3.5.0.Beta1"],
            "libio_latest_release": [1.5020442E9]
        }],
        "meta": {}
    }
}

Version 3.4.1

{
    "requestId": "50aa304a-d7ae-4aed-a847-65600cc6e3f3",
    "status": {
        "message": "",
        "code": 200,
        "attributes": {}
    },
    "result": {
        "data": [{
            "last_updated": [1.50823601887E9],
            "shipped_as_downstream": [false],
            "pname": ["io.vertx:vertx-core"],
            "vertex_label": ["Version"],
            "description": ["Sonatype helps open source projects to set up Maven repositories on https://oss.sonatype.org/"],
            "version": ["3.4.2"],
            "dependents_count": [11],
            "licenses": ["Apache 2.0", "EPL 1.0", "MIT License"]
            "declared_licenses": ["Eclipse Public License - v 1.0", "The Apache Software License, Version 2.0"],
            "pecosystem": ["maven"],
            "osio_usage_count": [6]
        }],
        "meta": {}
    }
}
tisnik commented 6 years ago

Thanks @miteshvp. If I understand correctly, these are results generated by following queries:

g.V().has("name", "io.vertx:vertx-core").has("ecosystem", "maven")

and

g.V().has("pname": "io.vertx:vertx-core").has("version", "3.4.2").has("pecosystem", "maven")
tisnik commented 6 years ago

Btw is last_updated really supposed to be float value? I'm pretty sure it must be int64 or uint64.

miteshvp commented 6 years ago

@tisnik - Your queries are right. That's what I used to generate the response. last_updated is a double value. If you see it closely it has E9 in the last

tisnik commented 6 years ago

thanks a lot @miteshvp for clarification.

re double value: yeah I know it's double, but I was interested why it's serialized this way, with lost of precision. Because for storing last_updated attribute, the str(time.time()) is used and this call returns proper Unix time, with or without decimal numbers (it's system dependent):

str(time.time())
1511895953.1909728

It would be interesting to know where the precision (at least four decimal digits) is lost - during the store operation, in the JSON serialization or somewhere in the middle?

(FYI: I'd need to look closely at the schema, as some attributes have strange types :)

miteshvp commented 6 years ago

@tisnik - is this card still blocked? Please let me know if you have more questions. Else suggest to remove the label accordingly. Thanks.

tisnik commented 6 years ago

@miteshvp no, it is no longer blocked. TY

tisnik commented 6 years ago

To be able to successfully develop, debug, and run these tests, the following issue need to be resolved: [f8a] data-importer: The import failed: 'status' #1526

miteshvp commented 6 years ago

@tisnik - https://github.com/fabric8-analytics/fabric8-analytics-data-model/pull/64

miteshvp commented 6 years ago

@tisnik - are you still blocked?

tisnik commented 6 years ago

@miteshvp your changes has been deployed to stage today and everything works. TY, I'm cleaning the status now: )

msrb commented 6 years ago

Marking as blocked as we are waiting for AWS creds for CI.

msrb commented 6 years ago

Still waiting for credentials to land in CI.

kbsingh commented 6 years ago

who is this blocked on ?

tisnik commented 6 years ago

@kbsingh we need to have S3 credentials to be used on CI (ie. we need to know hashes of the 'real' credentials)

msrb commented 6 years ago

@tisnik do we still wait for the credentials to be available in CI?

sivaavkd commented 6 years ago

@tisnik - are we still blocked ? Can you bring it up in today's standup?
@msrb - please let me know if you need help to unblock @tisnik

msrb commented 6 years ago

@tisnik are we still blocked here?

msrb commented 6 years ago

I am going to move this issue to the backlog. Will talk to @tisnik and we will come up with a new way of running these kind of tests.