mitodl / ocw-data-parser

A parsing script for MIT OpenCourseWare course data
0 stars 0 forks source link

ans7870 files #164

Open pdpinch opened 2 years ago

pdpinch commented 2 years ago

In legacy OCW, some static resources files are stored separately from the Plone data, in a S3 bucket (link). The files are recognizable by links that start with the path /ans7870/ followed by department number, course number and term/year abbreviation.

Currently in ocw-parser we attempt to copy the files that are linked into studio, but this strategy is a) not working and b) won't work for some files types.

ans7870 storage was used in the legacy site for (at least) two different reasons, each of which requires a different strategy.

1) There was a policy for static files over 10MB that they should be stored in ans7870 because the Plone CMS couldn't handle large binary files. 2) From time to time, course authors published a multi-file web structure -- a "mini-site" -- typically in HTML, but sometimes Java or Flash (swf). There was no support for this in Plone, so collections of files would be dropped on ans7870.

To address these cases, we should do the following:

1) One-off static files, in particular PDFs, should be treated the same as any other site resource. They should be copied to the correct location resource files and we should create enough minimal metadata so that they get a resource page. Links should be changed to appropriate resource links.

2a) In many cases, the courses that link out to complex "mini-sites" in HTML should be reauthored, but that is out of scope of this issue.

2b) For "mini-sites" that can't be reauthored, we can leave the links as is and use a Fastly redirect (similar to one that already exists on ocw-rc.odl.mit.edu) to direct requests to the files from S3. The URLs do not need to be rewritten in these cases.

3) The hard part may come to how to tell the different between these two cases. The best option right now is to base the decision on file extension.

pdpinch commented 2 years ago

A simpler option that should be considered is to just stop attempting to recover these ans7870 files via ocw-parser and leave them all to reauthoring efforts.

pdpinch commented 2 years ago

Here's an example of case 2)

legacy: https://ocw.mit.edu/courses/comparative-media-studies-writing/cms-611j-creating-video-games-fall-2014/projects/heat-wave/ working page & link in next gen: https://ocwnext.odl.mit.edu/courses/cms-611j-creating-video-games-fall-2014/pages/projects/heat-wave/ apparently working mini site in next gen: https://ocwnext.odl.mit.edu/ans7870/CMS/CMS.611/f14/games/heatwave/index.html

Note that having ocw-data-parser copy index.html to s3://open-learning-course-data-production/cms-611j-creating-video-games-fall-2014/ serves no purpose.