Closed aaron-collier closed 3 years ago
Intake is Python library for accessing heterogeneous data in an uniform fashion. While Intake has the strong support for CSV files, other data formats and sources like SQL databases are provided through different data loader driver plug-ins. Intake is closely integrated with Dask for creating parallel and scalable processing of large datasets that exceed available memory on typical laptops or servers.
The main output for Intake datasources is either a Panda DataFrame, numpy Array, or a list of Python objects (with Python dict
being the suggested choice). Both Pandas DataFrames and numpy Arrays are commonly used as inputs into machine learning workflows but may not be the best choice for DLME.
@jermnelson Thank you!
First glance question: Can you provide a quick explanation on what Panda dataframes are and why they may help?
@aaron-collier Pandas: Python data structures/analysis library. It's used a lot in data science: https://pandas.pydata.org/docs/index.html pandas.DataFrame: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
Building on @jcoyne comment, a Pandas DataFrame is a 2 dimensional data structure similar to a spreadsheet or a SQL table with rows and columns. DataFrames have relatively easy I/O operations with methods for ingestion of csv, json, XML, HTML, SQL, and others (even the user's clipboard!) data sources with similar serialization methods for many of these formats.
DataFrames have rich descriptive statistical methods as well as support for a number of built-in visualizations of the underlying data.
@jermnelson I think it depends on what we want to accomplish with writing to a Pandas DataFrame. Currently harvest scripts harvest data in many formats and then the traject configs map those formats to our intermediate representation (IR), a json object. Converting all incoming data to a DF would give us a relatively homogeneous data set to:
Or is the intention to scrap traject at some point and use intake to convert the DF into the json object we load into DLME?
Steps on the ETL workflow that would be simplified with all incoming data being converted to a Pandas DataFrame:
drop_duplicates
shape
, equals
, and Series.compare
making it easy to pinpoint data changes since the last harvest.columns
, to ensure none are missed during mapping and check to see if any new fields have been added since last harvest (will indicate whether the mapping file needs to be manually updated).At the cost of:
Checking the automation process steps it seems like intake would meet the minimum requirements.
In my experience, Pandas Dataframes are their own thing and are not idiomatic python. That is, you need to understand Dataframes; they are not intuitive if you just know python. I have found them rather tricky to work with, especially when only used occasionally. In general, I wouldn't recommend them (over plain-old python data structures) unless you have large datasets.
@justinlittman that is a good point. I am comfortable with Pandas but if someone else needs to learn my role it may pose a challenge. It is worth noting that the features we need are relatively straightforward since we are not using Pandas to transform the data. Essentially I envision us using the methods noted above which mainly follow the df.method() syntax. I don't anticipate it being that complex but we should definitely consider this and make sure we have a clear understanding of what we plan to do with it and consider alternative strategies.
Yeah, it raises the bar a couple of notches on what python expertise someone in your role must have.
I'm not sure I agree @justinlittman. With Pandas becoming ubiquitous for data scientists and engineers in big data and machine learning contexts, we may find people are more familiar with using Pandas with python.
Yes, if you expect that the user is a data scientist or engineer with experience in big data and machine learning then I agree.
Moving notes and discussion to analysis doc and closing.
https://intake.readthedocs.io/en/latest/index.html