Closed kmcelwee closed 3 years ago
I completely support the proposed changes in this, thank you very much for this pioneering work @kmcelwee !
I would also be curious if, after implementation, something like https://github.com/nteract/papermill could be leveraged to build a larger ETL pipelining system for DSpace package imports.
For now, focusing solely on integrating pandas and creating quality documentation. If down the road we think it would be worthwhile to combine the two in a jupyter notebook, we can invest the time.
We've decided to focus on Ruby instead.
I would propose the use of Jupyter Notebooks for any kind of infrequent data manipulation that members of PUL IT need to perform. More specifically, I think it should largely replace many of the scripts in
dspace-python
, a group of python scripts used to clean and merge spreadsheets in preparation for thesis migration.Jupyter
Pros
Inline debugging Jupyter allows you to run a section of code at a time, making debugging faster, especially because this code is visited once a year, and there’s plenty of opportunities for human error in data entry, export, etc.
Inline Markdown Visiting this code once a year will inevitably test our rusty memories. Jupyter notebooks allow for markdown in between sections of code to provide reminders, documentation, and frequent questions / problems. Ideally, we’d have a step-by-step pipeline that can coach a developer through this process each year.
Status displays We can remove the logging features from
dspace-python
and simply print any status updates immediately after each section of code is run.Cons
PRs, Version Control Notebooks are made in a json-like structure, and are not always rendered well in GitHub. PRs can also sometimes be unclear.
Code Reusability If we wanted to make a module of reusable scripts, they would have to be placed in a separate python file. Jupyter notebooks are meant for linear coding. Functions and classes can be defined of course, but they can only be used in the notebooks themselves. Usually this isn’t much a problem because Pandas covers almost all the functions created in
dspace-python
, and ideally the whole pipeline would occur on in one notebook. Notebooks do not lend themselves to elegant object-oriented design, but more brute-force scripting.Local setup To run a jupyter notebook, a user will need to have conda installed locally. Google Colab is an option, but I’m not familiar enough with the platform to know what obstacles we might encounter. However, downloading conda and running
jupyter notebook
takes 5 minutes. We’d have to weigh that decision down the road.Pandas
I have trouble thinking of any cons with using Pandas, given what we have right now. Excel spreadsheets are great for humans to interact with, but CSVs are far more appropriate for data ingestion.
Pandas is python package that efficiently manipulates dataframes. I’ve reconfigured
vireo.py
into a jupyter notebook with pandas. Pandas already contains the functionality and error handling for the following functions written invireo.py
log_info
row_values
iter_rows
matchingRows
col_index_of
col_names
createFromExcelFile
save
_derive_id_hash
For example
_derive_id_hash
……becomes…
Pandas is already tailored to searching and selecting and filtering, so there’s rarely a need to do any heavy lifting. If it sounds like something that should exist, it’s probably is already in the package. So when loading a spreadsheet…
…it becomes…
In tandem with Jupyter notebooks, Pandas becomes simple and powerful. CSVs are imported and written in one line, and map, reduce, and filter functions are straightforward and intuitive. Jupyter is built to display tables with pandas, making logging and debugging more clear. If there’s tabular data that needs to be processed, and can be done so locally, chances are Jupyter and pandas is your best bet.
Lastly, both Jupyter and Pandas are not going anywhere. They are an industry standard, and if we have any bugs or issues, plenty of other people will have had the same exact one.