Implement the raw data summary workflow step

Joe-Heffer-Shef commented 1 year ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

https://quarto.org/
R markdown
Excel

Additional context Add any other context or screenshots about the feature request here.

G-Accad commented 1 year ago

Is your feature request related to a problem? Please describe. When dealing with large volumes of raw data, it becomes challenging to quickly and effectively understand the characteristics and key insights from the data. Without a structured summary of the raw data, it's time-consuming and error-prone to make informed decisions or perform further analysis.

Describe the solution you'd like This workflow step should involve the following components:

Data Quality Checks: Perform data quality checks to identify missing values, duplicates, outliers, and any other data anomalies. This step ensures that the data is clean and reliable for analysis.
Data Profiling: Automatically generate descriptive statistics and metrics for the selected columns in the raw data, including measures like mean, median, standard deviation, and count. This will provide a high-level overview of the data's distribution and characteristics.
Data Visualization: Create visualizations such as histograms, box plots, and scatter plots for numeric data, and bar charts for categorical data.

Describe alternatives you've considered

Manual Data Summary: For example Excel (too time consuming)

Workflow:

graph LR
    func["Functions Runs"]
    input1("Type of Dataset") --> func
    input2("Columns of interest") --> func
    func --> output1("Generate Basic Descriptive Statistics")
    output1--> output2("Visualizations: Histograms, Box Plots")

G-Accad commented 1 year ago

Quarto vs R Markdown

Aspect	Quarto	R Markdown
Ease of Use	+ User-friendly, especially for non-technical users	- Requires some familiarity with R and Markdown syntax
	+ Simplified YAML configuration
	+ Built-in support for Pandoc templates
Document Structure	+ Flexible structure with notebooks, reports, and documents	- Standard Markdown structure with YAML header
		- Less flexibility in structuring documents
	+ Notebook-style interactivity	- Limited interactivity
Interactivity	+ Interactive code chunks	+ Supports interactive code chunks (with R)
	+ Data visualization with JavaScript	- Limited interactivity with other languages
Output Formats	+ Multiple output formats (HTML, PDF, Word)	+ Supports various output formats
	+ Customizable templates	- Templates can be customized
Extensibility and Ecosystem	+ Integration with the Quarto ecosystem	+ Established R Markdown ecosystem with numerous packages
	+ Growing community support
Learning Curve	+ Shorter learning curve for beginners	- Steeper learning curve for non-R users
	+ Easier for non-programmers	- More programming knowledge required

rcgsheffield / cured

Implement the raw data summary workflow step #1