mitodl / edx2bigquery

Tool to convert & load data from edX platform into BigQuery
GNU General Public License v2.0
29 stars 29 forks source link

Supporting the new course ID format? #52

Open natea opened 8 years ago

natea commented 8 years ago

Does edx2bigquery support the new way of formatting course IDs?

course-v1:MITx+24.00x+2013_SOND

or does it only support the old style way of representing course IDs?

MITx/24.00x/2013_SOND

When I try to load a course using the new format, I get an error:

Error!  Invalid course_id:
  BAD --> course-v1:MITx+24.00x+2013_SOND

Note: I'm not actually using that ID but a different ID, but it has the same format as this one.

natea commented 8 years ago

It looks like these two spots in the code are expecting a "/" delimiter in the course ID, rather than a '+' character as the delimiter.

https://github.com/mitodl/edx2bigquery/blob/master/edx2bigquery/main.py#L28 https://github.com/mitodl/edx2bigquery/blob/master/edx2bigquery/load_course_sql.py#L55

natea commented 8 years ago

Supporting the new course ID format was pretty easy with this two changes:

diff --git a/edx2bigquery/load_course_sql.py b/edx2bigquery/load_course_sql.py
index a2df6f8..680e0f4 100644
--- a/edx2bigquery/load_course_sql.py
+++ b/edx2bigquery/load_course_sql.py
@@ -52,7 +52,7 @@ def find_course_sql_dir(course_id, basedir, datedir=None, use_dataset_latest=Fal
         if not os.path.exists(lfp):
             # maybe course directory doesn't have the initial ORG- prefix (Harvard's local convention)
             olfp.append(lfp)
-            lfp = (basedir or '.') / course_id.split('/',1)[1].replace('/','-')
+            lfp = (basedir or '.') / course_id.split('+',1)[1].replace('+','-')
             if not os.path.exists(lfp):
                 msg = "Error!  Cannot find course SQL directory %s or %s" % (olfp , lfp)
                 print msg
diff --git a/edx2bigquery/main.py b/edx2bigquery/main.py
index c016484..8dc2262 100644
--- a/edx2bigquery/main.py
+++ b/edx2bigquery/main.py
@@ -25,7 +25,7 @@ else:
     print "WARNING: edx2bigquery needs a configuration file, ./edx2bigquery_config.py, to operate properly"

 def is_valid_course_id(course_id):
-    if not course_id.count('/')==2:
+    if not course_id.count('+')==2:
         return False
     return True

But now the load_course_sql script is looking for a users.csv file. I downloaded the one from the /sysadmin dashboard, but this doesn't seem to be the right format, because i'm now getting an error which seems to indicate that it's looking for an ID column of type integer.

Traceback (most recent call last):
  File "/Users/nateaune/Dropbox/code/edx2bigquery/edx2bigquery/main.py", line 136, in run_parallel_or_serial
    ret = function(param, course_id, optargs)
  File "/Users/nateaune/Dropbox/code/edx2bigquery/edx2bigquery/main.py", line 276, in setup_sql_single
    setup_sql(param, {}, "setup_sql", course_id)
  File "/Users/nateaune/Dropbox/code/edx2bigquery/edx2bigquery/main.py", line 236, in setup_sql
    use_dataset_latest=param.use_dataset_latest,
  File "/Users/nateaune/Dropbox/code/edx2bigquery/edx2bigquery/make_user_info_combo.py", line 140, in process_file
    uid = int(line['id'])
KeyError: 'id'

Is there a sample data dir that shows all the files that the scripts expect to be there, and the format they should be in?

ichuang commented 8 years ago

Hi Nate,

Try by starting with the "waldofy" command. This normalizes the SQL from edX, including renaming files to match HarvardX's standard.

edx2bigquery and XAnalytics both stick to using the "slash separated course_id" format. The funky "opaque keys" v1 + version is unsupported, because, to this point, it's been unnecessary. All the opaque format course_id's are converted to slash separated keys, during ingestion. See, for example,

https://github.com/mitodl/edx2bigquery/blob/master/edx2bigquery/split_and_rephrase.py#L107

in addition to the lines in load_course_sql.py. Note that XAnalytics also depends on the course_id being in slash separated format.

A welcome contribution would be to do this conversion uniformly, e.g. using the opaque keys library.

natea commented 7 years ago

Thanks Ike. I did run the waldofy command, but still having problems. I think what would be really helpful would be to provide a sample directory structure of the directory names that need to exist (below db_backups), and a matching edx2bigquery.conf file.

In other words, could you go into a working implementation of edx2bigquery, and run the "tree" command to show what directories and files need to be in the tree in order for the cmd to execute properly.

I'm getting hung up on the naming, so this would be really useful from a documentation perspective, to avoid all the trial and error that I'm having to do.