mitodl / open-discussions

BSD 3-Clause "New" or "Revised" License
10 stars 2 forks source link

Import and index OCW course files #2551

Closed mbertrand closed 4 years ago

mbertrand commented 4 years ago

As a user, when searching, I'd like to find matches for OCW courses based on their file content.

Acceptance Criteria:

mbertrand commented 4 years ago

@pdpinch @Ferdi do the acceptance criteria seem reasonable to you?

mbertrand commented 4 years ago

FYI counts of some OCW file types likely to contain text:

PDF: 64685
DOC: 12
DOCX: 10
HTML: 0
XLS: 286
XLSX: 74
TXT: 4
RTF: 4
mbertrand commented 4 years ago

@pdpinch @ferdi The course master_json includes a "course_pages" section with snippets of HTML, these do NOT exist separately on S3 as HTML files. Should they be indexed as course files anyway?

Example:

{
        "uid": "8bf424c425aad9ad262d7dc822c7247d",
        "url": "/courses/architecture/4-105-geometric-disciplines-and-architecture-skills-reciprocal-methodologies-fall-2012/syllabus",
        "text": "<h2 class=\"subhead\">Course Meeting Times</h2> <p>Lectures: 2 sessions /
 week, 2 hours / session</p> <h2 class=\"subhead\">Description</h2> <p>This course is an 
intensive introduction to the disciplines that impact the reciprocity between drawing and making in 
architectural design process, learned primarily through a series of weekly or bi-weekly exercises. 
The pedagogical aim of the course is two-fold.</p> <p>First, the exercises, lectures and 
workshops are designed to impart specific skills associated with the generation and representation
 of designed objects. These skills range from techniques of hand drafting, to generation of 3D 
computer models, physical model building, sketching, diagramming, and computing. The 
conceptual basis of each exercise is in the interrogation of the geometric principles that lie at the 
core of each technique, thus 'generalizing' the specific technique in order to display its wider 
generative possibilities. This process will also serve to exhibit the biases inherent in all drawing 
techniques as well as their relationships to their methods of making. These exercises establish a 
relationship with studio, and anticipate the instruments necessary to approach studio design 
problems.</p> <p>Second, the weekly lectures and pin-ups address the conventions associated 
with various modes of architectural representation, and their capacity to convey ideas. Instances of 
representation throughout the history of architecture will illustrate the relationship between specific
 techniques and the kind of architecture they engender. Pin-ups will address the entire range of 
issues associated with presenting architecture through drawings, including conceptual clarity, 
presentation manner, legibility, and the like.</p> <p>Exercises will require 3&ndash;8 hours of work
 (outside class meeting times) each week to adequately complete. This course is conceived to 
serve design studio rather than interferes with it; therefore students should not exceed 8 hours of
 work per week on the exercises. In general, lectures and assignment presentations will occur on
 Thursdays, with pin-ups and workshops on Tuesdays. Attendance at all class meetings is 
mandatory. Most importantly, be respectful and engaged during fellow students' pin-ups.</p> 
<h2 class=\"subhead\">Prerequisites</h2> <p>Because this course is offered to first-year Master
 of Architecture students, all participants are expected to have a fundamental understanding of 
design principles as well as a substantial skill set in both computer and physical modeling.</p> 
<h2 class=\"subhead\">Required Resources</h2> <p>Students will be asked to employ various 
techniques in modeling and representation. A variety of tools and software are available for the 
design process, and it is up to the student to elect which is most applicable.</p> <p>As a general 
rule, however, students will find themselves using 
<a href=\"http://www.rhino3d.com/\">Rhinoceros<sup>&reg;</sup></a> for computer-modeling,
and <a href=\"http://chaosgroup.com/en/2/vrayforrhino.html\">V-Ray</a> for renderings.</p>",
        "type": "CourseSection",
        "title": "Syllabus",
        "short_url": "syllabus",
        "parent_uid": "0007de9b4a0cd7c298d822b4123c2eaf",
        "description": "This syllabus section provides the course description and information on meeting times, prerequisites, and required resources."
    }
Ferdi commented 4 years ago

@mbertrand they should be indexed as well. Same as pdf files

mbertrand commented 4 years ago

Design for future resource search functionality to keep in mind when determining which fields to index: https://projects.invisionapp.com/share/5UV53L7CS4H#/screens/396517295_OCW_Course_Search-Resource_Search

mbertrand commented 4 years ago

@Ferdi @gumaerc not sure if we can calculate the OCW link for course files from the master_json saved to S3? The master_json saved to S3 is different from the original, the "file_location" field for each file is modified to be the S3 URL.

So instead of something like /courses/mathematics/18-01sc-single-variable-calculus-fall-2010/final-exam/MIT18_01SCF10_final.pdf it is /18-01sc-single-variable-calculus-fall-2010/<file_uid>_MIT18_01SCF10_final.pdf

Carey do you think there is a reliable way to determine the original URL from the modified one?

This doesn't matter for course search results now. In the future, when we enable search for course files, the links will point to S3 instead of OCW.

gumaerc commented 4 years ago

@Ferdi @gumaerc not sure if we can calculate the OCW link for course files from the master_json saved to S3? The master_json saved to S3 is different from the original, the "file_location" field for each file is modified to be the S3 URL.

So instead of something like /courses/mathematics/18-01sc-single-variable-calculus-fall-2010/final-exam/MIT18_01SCF10_final.pdf it is /18-01sc-single-variable-calculus-fall-2010/<file_uid>_MIT18_01SCF10_final.pdf

Carey do you think there is a reliable way to determine the original URL from the modified one?

This doesn't matter for course search results now. In the future, when we enable search for course files, the links will point to S3 instead of OCW.

@mbertrand as far as I know, the S3 links are injected by ocw_data_parser when you run setup_s3_uploading followed by upload_all_media_to_s3. When running upload_all_media_s3, this function is executed: update_file_location(self.master_json, bucket_base_url + filename, uid) which updates the file_location property to be the S3 link. If you wanted a master.json without the S3 links, I believe you could just simply run ocw_data_parser against the source json without executing the S3 functions.

Ferdi commented 4 years ago

@mbertrand we should be able to calculate the right link in the future so It should be fine for now.

mbertrand commented 4 years ago

@gumaerc I'm trying to get all the info needed from the modified master_json saved to S3, and avoid having to read the original master_json from the source bucket as well.

gumaerc commented 4 years ago

@gumaerc I'm trying to get all the info needed from the modified master_json saved to S3, and avoid having to read the original master_json from the source bucket as well.

Well, just looking at the url structure:

  1. /courses/mathematics/18-01sc-single-variable-calculus-fall-2010 could be pulled from the url property at the root of master.json
  2. final-exam is the course page that is the parent of this file. This could be pulled by looking at the parent_uid property on the course_files object relating to this file, then finding that page by the uid property in course_pages and getting the short_url property
  3. file name can be pulled out of the file_location by stripping out the base url you've constructed so far along with the uid and the underscore
mbertrand commented 4 years ago

thanks, I'll give that a try

pdpinch commented 4 years ago

@mbertrand I think we can close this, now that #2584 is merged?