opendataConcordiaU / documentation

Documentation and examples on the use of Concordia University open data API
35 stars 4 forks source link

Courses API returning duplicate courses #8

Closed stefanrusso closed 1 year ago

stefanrusso commented 4 years ago

When calling the courses API, most courses are being returned multiple times. I've tested it with COMP and SOEN courses, but both result in duplicate courses.

Call - https://opendata.concordia.ca/API/v1/course/catalog/filter/COMP/*/UGRD Resulting JSON - https://pastebin.com/6LGKqnKB

opendataConcordiaU commented 4 years ago

I'll look at this ... thank you for reporting the issue. I'll ask the data owners once get back to normal. I'm leaving the issue open for the time being

SpencerMartel commented 2 years ago

Wondering if this ever got resolved, i'm dealing with the same issue currently

volovikariel commented 2 years ago

Still doesn't appear to be resolved. See: https://opendata.concordia.ca/API/v1/course/catalog/filter/COMP/201/UGRD

SpencerMartel commented 2 years ago

Nope, here's some python code to clean it if you'd like:

    print('Entering clean_duplicate_data function')
    clean_data = []
    for obj in working_data:
        if clean_data.__contains__(obj):
            del obj
        else:
            obj_copy = obj.copy()
            clean_data.append(obj_copy)
    print('Data is cleaned (removed of duplicates)')
    return clean_data
volovikariel commented 2 years ago

Thanks! Just some questions - why are you deleting objects from working_data if you're only going to be using clean_data anyways? Same goes for obj_copy, could we not just append it to the clean_data list like this?

    print('Entering clean_duplicate_data function')
    clean_data = []
    for obj in working_data:
        if obj not in clean_data:
            clean_data.append(obj)
    print('Data is cleaned (removed of duplicates)')
    return clean_data

Maybe we can just check if the unique ID is already present in a seen set and not have to deal with dict comparisons, feel like this would be faster! Problem is that this won't work for Dicts without a unique key other than ID (or no unique key at all).

    print('Entering clean_duplicate_data function')
    clean_data = []
    seen_ids = set()
    for obj in working_data:
        if obj.get("ID") not in seen_ids:
            clean_data.append(obj)
            seen_ids.add(obj.get("ID"))
    print('Data is cleaned (removed of duplicates)')
    return clean_data

Note: a friend noticed that issues arise when one piece of data that we get has no ID, it gets added to the clean_data set anyways, so add a 'ID' in obj check if you wish

opendataConcordiaU commented 1 year ago

The issue was fixed on the endpoint level. The response is now deduplicated. Closing the issue.