italia / developers-italia-api

API for the developers.italia.it public software collection
https://api.developers.italia.it
GNU Affero General Public License v3.0
12 stars 6 forks source link

API for retrieving metrics on software catalog #174

Open biancini opened 5 years ago

biancini commented 5 years ago

To integrate the work ongoing on metric creation for Developers /Italia, it would be great to have an API (talking JSON) that shows the following data:

If the crawler has the data, it would be great to have this JSON API also proposing the evolution of these numbers over time (since the beginning of Developers /Italia). The output could be of this form:

[
  "2017-07-21T00:00:00Z": {
    "num_software_pa": 30,
    "num_sofware_thirdparty": 4,
    "num_administrations": 5,
    "mean_vitality": 0.67
  },
  ...
  "2018-08-22T00:00:00Z": {
    "num_software_pa": 30,
    "num_sofware_thirdparty": 4,
    "num_administrations": 5,
    "mean_vitality": 0.67
  },
]
libremente commented 4 years ago

Let's have the crawler query the ES and output the results in a JSON file. Such a file will be public in a directory served by nginx. See https://github.com/italia/developers.italia.it/issues/406 as a reference to such a public dir.

sebbalex commented 4 years ago

@biancini we could close this issue since the point was achieved by https://github.com/italia/developers.italia.it/issues/406 What do you think?

libremente commented 4 years ago

@sebbalex I believe the solution that is in use right now is not an API so I would leave this open for future improvements. I still believe it could be nice to have an actual and proper API for this.

bfabio commented 1 year ago

Moving the issue to developers-italia-api

bfabio commented 5 months ago

This should be doable now with something like:

import json
from collections import defaultdict

import requests
import yaml

API_BASE_URL = "https://api.developers.italia.it/v1"

def get_paginated(resource: str):
    items = []

    page = True
    page_after = ""

    while page:
        res = requests.get(f"{API_BASE_URL}/{resource}?all=true&{page_after}")
        res.raise_for_status()

        body = res.json()
        items += body["data"]

        page_after = body["links"]["next"]
        if page_after:
            # Remove the '?'
            page_after = page_after[1:]

        page = bool(page_after)

    return items

software = get_paginated("software")
publishers = get_paginated("publishers")

by_date = defaultdict(
    lambda: {
        "num_software_pa": 0,
        "num_software_thirdparty": 0,
        "num_administrations": 0,
    }
)

for s in software:
    date = s["createdAt"][:10]
    try:
        publiccode = yaml.safe_load(s["publiccodeYml"])
        if publiccode.get("it", {}).get("riuso", {}).get("codiceIPA"):
            by_date[date]["num_software_pa"] += 1
        else:
            by_date[date]["num_software_thirdparty"] += 1
    except:
        pass

administrations = set()
for publisher in publishers:
    if publisher.get("alternativeId"):
        administrations.add(publisher["id"])
        date = publisher["createdAt"][:10]
        by_date[date]["num_administrations"] = len(administrations)

print(json.dumps([{date: counts} for date, counts in by_date.items()], indent=4))

but I wouldn't turn into an endpoint into the API, as the data is easily available without hardcoding the metrics.