pulibrary / dspace-python

Python Scripts for DSpace Administration
0 stars 0 forks source link

Integrate pandas into thesis import workflow #16

Closed kmcelwee closed 3 years ago

kmcelwee commented 4 years ago

I would propose the use of Jupyter Notebooks for any kind of infrequent data manipulation that members of PUL IT need to perform. More specifically, I think it should largely replace many of the scripts in dspace-python, a group of python scripts used to clean and merge spreadsheets in preparation for thesis migration.

Jupyter

Pros

Inline debugging Jupyter allows you to run a section of code at a time, making debugging faster, especially because this code is visited once a year, and there’s plenty of opportunities for human error in data entry, export, etc.

Inline Markdown Visiting this code once a year will inevitably test our rusty memories. Jupyter notebooks allow for markdown in between sections of code to provide reminders, documentation, and frequent questions / problems. Ideally, we’d have a step-by-step pipeline that can coach a developer through this process each year.

Status displays We can remove the logging features from dspace-python and simply print any status updates immediately after each section of code is run.

Cons

PRs, Version Control Notebooks are made in a json-like structure, and are not always rendered well in GitHub. PRs can also sometimes be unclear.

Code Reusability If we wanted to make a module of reusable scripts, they would have to be placed in a separate python file. Jupyter notebooks are meant for linear coding. Functions and classes can be defined of course, but they can only be used in the notebooks themselves. Usually this isn’t much a problem because Pandas covers almost all the functions created in dspace-python, and ideally the whole pipeline would occur on in one notebook. Notebooks do not lend themselves to elegant object-oriented design, but more brute-force scripting.

Local setup To run a jupyter notebook, a user will need to have conda installed locally. Google Colab is an option, but I’m not familiar enough with the platform to know what obstacles we might encounter. However, downloading conda and running jupyter notebook takes 5 minutes. We’d have to weigh that decision down the road.


Pandas

I have trouble thinking of any cons with using Pandas, given what we have right now. Excel spreadsheets are great for humans to interact with, but CSVs are far more appropriate for data ingestion.

Pandas is python package that efficiently manipulates dataframes. I’ve reconfigured vireo.py into a jupyter notebook with pandas. Pandas already contains the functionality and error handling for the following functions written in vireo.py

For example _derive_id_hash

def _derive_id_hash(self):
    id_rows = {}
    for row in self._sheet.iter_rows(min_row=2):
        if not row[self.id_col].value:
            logging.warning("%s: Row has no ID value: %s" % (self.file_name, str.join(',',(str(cell.value).strip() for cell in row))))
        else:
            id = int(row[self.id_col].value)
            if not id in id_rows:
                id_rows[id] = [row]
            elif self.unique_ids:
                raise Exception("%s has duplicate id %s" % (self.file_name, str(id)))
            else:
                id_rows[id].append(row)
    return id_rows

id_rows = df._derive_id_hash()
row = id_rows[SEARCH_ID]

…becomes…

row = df[df['ID'] == SEARCH_ID]
# and if you wanted to double check that all IDs are unique...
assert df['ID'].unique().shape[0] == df['ID'].shape[0]

Pandas is already tailored to searching and selecting and filtering, so there’s rarely a need to do any heavy lifting. If it sounds like something that should exist, it’s probably is already in the package. So when loading a spreadsheet…

def readRestrictions(self, restriction_file, check_id=True):
    restrictions = VireoSheet.createFromExcelFile(restriction_file, unique_ids=False)
    # check that necessary columns are present
    restrictions.col_index_of(VireoSheet.R_STUDENT_NAME, required=True)
    restrictions.col_index_of(VireoSheet.R_TITLE, required=True)
    restrictions.col_index_of(VireoSheet.R_WALK_IN, required=True)
    restrictions.col_index_of(VireoSheet.R_EMBARGO, required=True)
    # check wheter restriction IDs make sense
    # look through whether certs file info matches thesis sheet info
    if (check_id):
        for restr_id in restrictions.id_rows:
            if  not  restr_id in  self.id_rows:
                raise Exception("%s, row with ID %d: there is no such row in %s" % (restrictions.file_name, restr_id, self.file_name))
    return restrictions

…it becomes…

import pandas as pd

df_r = pd.read_csv('restrictions.csv')
df_t = pd.read_csv('thesis.csv')

# make sure that all IDs in restrictions are in the thesis table
assert all([r['ID'] in df_t['ID'].tolist() for (i, r) in df_r.iterrows()])

In tandem with Jupyter notebooks, Pandas becomes simple and powerful. CSVs are imported and written in one line, and map, reduce, and filter functions are straightforward and intuitive. Jupyter is built to display tables with pandas, making logging and debugging more clear. If there’s tabular data that needs to be processed, and can be done so locally, chances are Jupyter and pandas is your best bet.

Lastly, both Jupyter and Pandas are not going anywhere. They are an industry standard, and if we have any bugs or issues, plenty of other people will have had the same exact one.

jrgriffiniii commented 4 years ago

I completely support the proposed changes in this, thank you very much for this pioneering work @kmcelwee !

I would also be curious if, after implementation, something like https://github.com/nteract/papermill could be leveraged to build a larger ETL pipelining system for DSpace package imports.

kmcelwee commented 4 years ago

For now, focusing solely on integrating pandas and creating quality documentation. If down the road we think it would be worthwhile to combine the two in a jupyter notebook, we can invest the time.

kmcelwee commented 3 years ago

We've decided to focus on Ruby instead.