[SPIKE] Investigate the Intake library for improving harvesting

aaron-collier commented 3 years ago

https://intake.readthedocs.io/en/latest/index.html

jermnelson commented 3 years ago

Intake Analysis

Intake is Python library for accessing heterogeneous data in an uniform fashion. While Intake has the strong support for CSV files, other data formats and sources like SQL databases are provided through different data loader driver plug-ins. Intake is closely integrated with Dask for creating parallel and scalable processing of large datasets that exceed available memory on typical laptops or servers.

The main output for Intake datasources is either a Panda DataFrame, numpy Array, or a list of Python objects (with Python dict being the suggested choice). Both Pandas DataFrames and numpy Arrays are commonly used as inputs into machine learning workflows but may not be the best choice for DLME.

Pros

Actively developed
Abstracts data sources from YAML configurations that are bundled into data Catalogs
Widely used for machine learning data applications
Catalogs can be constructed in hierarchy

Cons

Not a true ETL solution but can used as part of ETL pipelines
Would likely require development of a Intake plug-ins or data Transforms for OAI-PMH and MARC XML sources

Questions

If each DLME source is converted to a Pandas dataframe, would that be sufficient for the end user needs?

aaron-collier commented 3 years ago

@jermnelson Thank you!

First glance question: Can you provide a quick explanation on what Panda dataframes are and why they may help?

jcoyne commented 3 years ago

@aaron-collier Pandas: Python data structures/analysis library. It's used a lot in data science: https://pandas.pydata.org/docs/index.html pandas.DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

jermnelson commented 3 years ago

Building on @jcoyne comment, a Pandas DataFrame is a 2 dimensional data structure similar to a spreadsheet or a SQL table with rows and columns. DataFrames have relatively easy I/O operations with methods for ingestion of csv, json, XML, HTML, SQL, and others (even the user's clipboard!) data sources with similar serialization methods for many of these formats.

DataFrames have rich descriptive statistical methods as well as support for a number of built-in visualizations of the underlying data.

jacobthill commented 3 years ago

@jermnelson I think it depends on what we want to accomplish with writing to a Pandas DataFrame. Currently harvest scripts harvest data in many formats and then the traject configs map those formats to our intermediate representation (IR), a json object. Converting all incoming data to a DF would give us a relatively homogeneous data set to:

test changes since the last harvest, drop duplicate records, and write other QA tests against.
simplify our traject configs somewhat since we could export the DF as a csv file and use only the traject csv writer, which is considerably faster than others as far as I can tell.

Or is the intention to scrap traject at some point and use intake to convert the DF into the json object we load into DLME?

jacobthill commented 3 years ago

Steps on the ETL workflow that would be simplified with all incoming data being converted to a Pandas DataFrame:

Remove duplicate records with drop_duplicates
Compare the last DataFrame harvested to the new harvest with shape, equals, and Series.compare making it easy to pinpoint data changes since the last harvest.
Get all fields from raw data with columns, to ensure none are missed during mapping and check to see if any new fields have been added since last harvest (will indicate whether the mapping file needs to be manually updated).
Easy pre-transform QA reports that can be compared with post-transform QA reports, ensuring all values were successfully transformed.
Reduces the number of macros, tests needed in traject since we would only use the traject csv writer
Speeds up transformation since the csv writer is faster
Reduces the complexity of mapping (e.g. no xml namespaces, etc.) making it easier for others to learn

At the cost of:

Revising/completely rewriting all configs (or we could harvest data and write it in whatever format it comes and only use the DataFrame for analysis)
Possibly introducing errors while writing the source data to the DF, we would need more tests to ensure this doesn't happen.

Checking the automation process steps it seems like intake would meet the minimum requirements.

justinlittman commented 3 years ago

In my experience, Pandas Dataframes are their own thing and are not idiomatic python. That is, you need to understand Dataframes; they are not intuitive if you just know python. I have found them rather tricky to work with, especially when only used occasionally. In general, I wouldn't recommend them (over plain-old python data structures) unless you have large datasets.

jacobthill commented 3 years ago

@justinlittman that is a good point. I am comfortable with Pandas but if someone else needs to learn my role it may pose a challenge. It is worth noting that the features we need are relatively straightforward since we are not using Pandas to transform the data. Essentially I envision us using the methods noted above which mainly follow the df.method() syntax. I don't anticipate it being that complex but we should definitely consider this and make sure we have a clear understanding of what we plan to do with it and consider alternative strategies.

justinlittman commented 3 years ago

Yeah, it raises the bar a couple of notches on what python expertise someone in your role must have.

jermnelson commented 3 years ago

I'm not sure I agree @justinlittman. With Pandas becoming ubiquitous for data scientists and engineers in big data and machine learning contexts, we may find people are more familiar with using Pandas with python.

justinlittman commented 3 years ago

Yes, if you expect that the user is a data scientist or engineer with experience in big data and machine learning then I agree.

aaron-collier commented 3 years ago

Moving notes and discussion to analysis doc and closing.

sul-dlss / dlme-harvest