sul-dlss / dlme-harvest

DLME Scripts for harvesting data from providers
0 stars 0 forks source link

[SPIKE] Investigate the Intake library for improving harvesting #87

Closed aaron-collier closed 3 years ago

aaron-collier commented 3 years ago

https://intake.readthedocs.io/en/latest/index.html

jermnelson commented 3 years ago

Intake Analysis

Intake is Python library for accessing heterogeneous data in an uniform fashion. While Intake has the strong support for CSV files, other data formats and sources like SQL databases are provided through different data loader driver plug-ins. Intake is closely integrated with Dask for creating parallel and scalable processing of large datasets that exceed available memory on typical laptops or servers.

The main output for Intake datasources is either a Panda DataFrame, numpy Array, or a list of Python objects (with Python dict being the suggested choice). Both Pandas DataFrames and numpy Arrays are commonly used as inputs into machine learning workflows but may not be the best choice for DLME.

Pros

Cons

Questions

  1. If each DLME source is converted to a Pandas dataframe, would that be sufficient for the end user needs?
aaron-collier commented 3 years ago

@jermnelson Thank you!

First glance question: Can you provide a quick explanation on what Panda dataframes are and why they may help?

jcoyne commented 3 years ago

@aaron-collier Pandas: Python data structures/analysis library. It's used a lot in data science: https://pandas.pydata.org/docs/index.html pandas.DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

jermnelson commented 3 years ago

Building on @jcoyne comment, a Pandas DataFrame is a 2 dimensional data structure similar to a spreadsheet or a SQL table with rows and columns. DataFrames have relatively easy I/O operations with methods for ingestion of csv, json, XML, HTML, SQL, and others (even the user's clipboard!) data sources with similar serialization methods for many of these formats.

DataFrames have rich descriptive statistical methods as well as support for a number of built-in visualizations of the underlying data.

jacobthill commented 3 years ago

@jermnelson I think it depends on what we want to accomplish with writing to a Pandas DataFrame. Currently harvest scripts harvest data in many formats and then the traject configs map those formats to our intermediate representation (IR), a json object. Converting all incoming data to a DF would give us a relatively homogeneous data set to:

Or is the intention to scrap traject at some point and use intake to convert the DF into the json object we load into DLME?

jacobthill commented 3 years ago

Steps on the ETL workflow that would be simplified with all incoming data being converted to a Pandas DataFrame:

At the cost of:

Checking the automation process steps it seems like intake would meet the minimum requirements.

justinlittman commented 3 years ago

In my experience, Pandas Dataframes are their own thing and are not idiomatic python. That is, you need to understand Dataframes; they are not intuitive if you just know python. I have found them rather tricky to work with, especially when only used occasionally. In general, I wouldn't recommend them (over plain-old python data structures) unless you have large datasets.

jacobthill commented 3 years ago

@justinlittman that is a good point. I am comfortable with Pandas but if someone else needs to learn my role it may pose a challenge. It is worth noting that the features we need are relatively straightforward since we are not using Pandas to transform the data. Essentially I envision us using the methods noted above which mainly follow the df.method() syntax. I don't anticipate it being that complex but we should definitely consider this and make sure we have a clear understanding of what we plan to do with it and consider alternative strategies.

justinlittman commented 3 years ago

Yeah, it raises the bar a couple of notches on what python expertise someone in your role must have.

jermnelson commented 3 years ago

I'm not sure I agree @justinlittman. With Pandas becoming ubiquitous for data scientists and engineers in big data and machine learning contexts, we may find people are more familiar with using Pandas with python.

justinlittman commented 3 years ago

Yes, if you expect that the user is a data scientist or engineer with experience in big data and machine learning then I agree.

aaron-collier commented 3 years ago

Moving notes and discussion to analysis doc and closing.