mitodl / ocw-data-parser

A parsing script for MIT OpenCourseWare course data
0 stars 0 forks source link

parse Archived versions #141

Closed pdpinch closed 3 years ago

pdpinch commented 3 years ago

part of #53

When a course is archived some additional metadata is added to the course, and also to its successor. This metadata allows us to build the "Archived OCW Versions" list on a CHP.

For the archived course, a dspace_handle is added. For example, in s3://ocw-content-storage/PROD/18/18.01/Fall_2003/18-01-single-variable-calculus-fall-2003/0/1.json

    "dspace_handle": "hdl://1721.1/34901", 

for the successor course, a feature is added to features_tracking. In s3://ocw-content-storage/PROD/18/18.01/Fall_2005/18-01-single-variable-calculus-fall-2005/0/1.json for example:

    "features_tracking": [
        {
            "ocw_feature": "Previous version", 
            "ocw_subfeature": "", 
            "ocw_feature_url": "http://hdl.handle.net/1721.1/34901", 
            "ocw_feature_notes": "Fall 2003 version of the course", 
            "ocw_speciality": ""
        }, 

and a field is added, is_update_of (not sure why this is a list):

    "is_update_of": [
        "bf3718587cf35b02dc6b287533a391d9"
    ], 

The combination of the features_tracking (with the handle.net URL) and the is_update_of should be sufficient to construct the necessary link URL and link text for the successor course CHP.

pdpinch commented 3 years ago

I find the data structure in the raw JSON / Plone almost impenetrable. If you want to refactor it into something easier to follow, it would make me happy.

noisecapella commented 3 years ago

I don't think we can do much in ocw-data-parser to refactor this since we treat courses independently. But we can do that in ocw-to-hugo when we produce the other versions text

pdpinch commented 3 years ago

Does the necessary data show up in the processed JSON so that we can use it ocw-to-hugo?

noisecapella commented 3 years ago

Not yet but I'm planning on a PR for that. It's just going to pass through dspace_handle, features_tracking, and the first item of is_update_of. I'll make another PR to handle it in ocw-to-hugo

noisecapella commented 3 years ago

@pdpinch Do we know how this should show up in the UI yet?

noisecapella commented 3 years ago

Is dspace_handle just an indicator that a course is archived or is there a value that should be parsed from either of the urls?

pdpinch commented 3 years ago

Do we know how this should show up in the UI yet?

You can see it on https://projects.invisionapp.com/share/QFZ8KA9SH2P#/screens/435953154_OCW_Course_Home_Page_Color_10-27-2020_Course_Info_Collapsed_V2

I hadn't made an issue for ocw-hugo yet because it was waiting for this one to be closed.

noisecapella commented 3 years ago

Screenshot_20210604-154835

The 18.01sc on "Other courses" has "scholar" at the end of the name but not for archived courses. Should "scholar" appear on both? And in general should the text be identical if the course is the same?

pdpinch commented 3 years ago

The example in InVision isn't based on real data.

The 18.01sc on "Other courses" has "scholar" at the end of the name but not for archived courses. Should "scholar" appear on both?

In practice, no scholar course has ever been archived. I wouldn't worry about it.

And in general should the text be identical if the course is the same?

I'm not sure which text you mean, but if a course has been archived, it should only appear in the "Archived OCW Versions" section.

Put another way, the links under "Archived OCW Versions" should only go to hdl.handle.net URLs

pdpinch commented 3 years ago

Is dspace_handle just an indicator that a course is archived or is there a value that should be parsed from either of the urls?

The value for dspace_handle is a unique "handle" identifier. It can be parsed to generate the more useful handle.net URL. I would suggest preserving both in the parsed JSON unless you're confident that they are both always present. (Since these rely on data entry in the legacy CMS, I don't know if they have been input consistently)

The ocw_feature_url should be used as the HREF for links in the "archived courses" section, same as on https://ocw.mit.edu/courses/mathematics/18-01-single-variable-calculus-fall-2005/

noisecapella commented 3 years ago

I'm going to close this since https://github.com/mitodl/ocw-to-hugo/issues/274 should handle all remaining work

noisecapella commented 3 years ago

@pdpinch I did some research yesterday and this morning and I don't think features_tracking is reliable to use with is_update_of to create dspace links with course titles. There are some courses which have multiple previous versions in features_tracking, and some which have multiple items in is_update_of. I am not sure we can say for sure that what is described in is_update_of and features_tracking matches up exactly even if there is only 1 previous version and 1 uid. Instead I think we should only use is_update_of and dspace_handle and just leave out the link if the two don't match up. What do you think?

There are only a small number of dspace links in features_tracking which don't appear in dspace_handle somewhere:

New hdl: 1721.1/120335 21a-120-american-dream-using-storytelling-to-explore-social-class-in-the-united-states-spring-2018
New hdl: 1721.1/121500 6-057-introduction-to-matlab-january-iap-2019
New hdl: 1721.1/121170 6-436j-fundamentals-of-probability-fall-2018
New hdl: 1721.1/75824 6-005-elements-of-software-construction-fall-2011
New hdl: 1721.1/121185 1-258j-public-transportation-systems-spring-2017
New hdl: 1721.1/120336 5-61-physical-chemistry-fall-2017
New hdl: 1721.1/121583 14-381-statistical-method-in-economics-fall-2006
New hdl: 1721.1/121583 14-381-statistical-method-in-economics-fall-2018
New hdl: 1721.1/120951 21g-103-chinese-iii-regular-fall-2018
New hdl: 1721.1/120952 21g-103-chinese-iii-regular-fall-2018

There are a decent number of course references from is_update_of which don't have a dspace_handle, about 142 from my script