An ongoing data-cleaning and presentation project using Llama 3.1, AWS, and Streamlit to analyze and highlight Native Hawaiian data. Cleans and unifies data from multiple sources, utilizing AWS for storage and serving interactive visualizations through Streamlit.
Continues work started in issue listed at the end of this section. While I decided that it would be valuable to keep the basic implementation and functionality of the custom dataset object dfstore so as to better track and consolidate the data cleaning process for datasets originating from tabbed excel files, the original implementation has features that I do not think are worthwhile to have at this point and would only slow development. Thus I need to revisit the dfstore class, trimming it down and simplifying it's functionality.
Revisit and rework dfstore class, focusing on revising it to narrowly focus on making my task of cleaning the tabbed excel data from the Native Hawaiian Data Book 2023 easier.
Acceptance Criteria:
[x] Revisit need for HashedKeyDict class and abbreviation mappings in general
[x] Originally intended data transparency functionality of dfstore is either simplified or removed all together (i.e no need to store original dataset as it was; likely enough to just point to origin in some way)
[x] ~Basic dfstore tagging to track what has and has not been cleaned~
[x] ~Choose file format to save dfstore in (i.e. keep pickled data or save to json or xml); will require some investigation~
[x] ~Revisit dfstore naming conventions as needed~
Decided to remove original dfstore implementation as pandas and polars already offer functionality accomplishing basis of what I wanted
Will still want to implement custom class of this that allows for tracking last updated and whatnot but easier to start from scratch then refactor
Also decided to remove anything related to previous attempt at abbreviation conversion as it will be better to have full names to provide fuller context to LLM in future LLM interaction with dataset
Description:
Continues work started in issue listed at the end of this section. While I decided that it would be valuable to keep the basic implementation and functionality of the custom dataset object
dfstore
so as to better track and consolidate the data cleaning process for datasets originating from tabbed excel files, the original implementation has features that I do not think are worthwhile to have at this point and would only slow development. Thus I need to revisit thedfstore
class, trimming it down and simplifying it's functionality.Goal:
Revisit and rework
dfstore
class, focusing on revising it to narrowly focus on making my task of cleaning the tabbed excel data from the Native Hawaiian Data Book 2023 easier.Acceptance Criteria:
HashedKeyDict
class and abbreviation mappings in generaldfstore
is either simplified or removed all together (i.e no need to store original dataset as it was; likely enough to just point to origin in some way)dfstore
tagging to track what has and has not been cleaned~dfstore
in (i.e. keep pickled data or save to json or xml); will require some investigation~dfstore
naming conventions as needed~