qri-io / starlib

qri's standard library for starlark
MIT License
111 stars 29 forks source link

dataframe headers don't make it into the qri dataset #113

Open chriswhong opened 2 years ago

chriswhong commented 2 years ago

Given the following code, we expect the resulting Qri dataset body to have a column named firstname. Instead we see the first row as the first column name.

# CSV Download Code Sample
# This really works! Click 'Dry Run' to try it ↗

# import dependencies
load("http.star", "http") # `http` lets us talk to the internets
load("dataframe.star", "dataframe") # `dataframe` gives us powerful dataset manipulation capabilities

# with dependencies loaded, download a CSV
# this fetches a "popular baby names" dataset from the NYC Open Data Portal
csvDownloadUrl = "https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD"
rawCSV = http.get(csvDownloadUrl).body()

# parse the CSV (string) into a qri DataFrame
theData = dataframe.parse_csv(rawCSV)

# we can do filtering of the DataFrame and assign it back to its original variable
# filter for first names that start with 'V'
theData = theData[[x.startswith('V') for x in theData["Child's First Name"]]]

# each column in the DataFrame is a Series
# make a new `Series` with only the unique values
uniqueSeries = theData["Child's First Name"].unique()

# iterate over the Series and convert each string to lowercase
for idx, val in enumerate(uniqueSeries):
    uniqueSeries[idx] = val.lower()

# sort the Series alphabetically
uniqueSeries = sorted(uniqueSeries)

# make an empty DataFrame, assign our Series to be a column named 'firstname'
# this will become the next version of our dataset's body
newBody = dataframe.DataFrame()
newBody['firstname'] = uniqueSeries

# get the previous version of this dataset
workingDataset = dataset.latest()
# set the body of the dataset to be our new body
workingDataset.body = newBody

# finally, commit the changes
# the last step of every transform is always `dataset.commit(Dataset)`
dataset.commit(workingDataset)
dustmop commented 2 years ago

Figured out the root cause of this bug. The line workingDataset.body = newBody does not correctly copy the columns from newBody to the workingDataset object. Fix should be fairly straight-forward to make.