rcgsheffield / cured

CURED project working area for RIT
https://docs.google.com/document/d/1ulfAxMoY3yxu5vaMsuQPPhuovEfECmOr82bA_wHgkGI/edit?usp=sharing
MIT License
0 stars 0 forks source link

Implement the raw data summary workflow step #1

Open Joe-Heffer-Shef opened 1 year ago

Joe-Heffer-Shef commented 1 year ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

G-Accad commented 1 year ago

Is your feature request related to a problem? Please describe. When dealing with large volumes of raw data, it becomes challenging to quickly and effectively understand the characteristics and key insights from the data. Without a structured summary of the raw data, it's time-consuming and error-prone to make informed decisions or perform further analysis.

Describe the solution you'd like This workflow step should involve the following components:

  1. Data Quality Checks: Perform data quality checks to identify missing values, duplicates, outliers, and any other data anomalies. This step ensures that the data is clean and reliable for analysis.
  2. Data Profiling: Automatically generate descriptive statistics and metrics for the selected columns in the raw data, including measures like mean, median, standard deviation, and count. This will provide a high-level overview of the data's distribution and characteristics.
  3. Data Visualization: Create visualizations such as histograms, box plots, and scatter plots for numeric data, and bar charts for categorical data.

Describe alternatives you've considered

  1. Manual Data Summary: For example Excel (too time consuming)

Workflow:

graph LR
    func["Functions Runs"]
    input1("Type of Dataset") --> func
    input2("Columns of interest") --> func
    func --> output1("Generate Basic Descriptive Statistics")
    output1--> output2("Visualizations: Histograms, Box Plots")
G-Accad commented 1 year ago

Quarto vs R Markdown

Aspect Quarto R Markdown
Ease of Use + User-friendly, especially for non-technical users - Requires some familiarity with R and Markdown syntax
+ Simplified YAML configuration
+ Built-in support for Pandoc templates
Document Structure + Flexible structure with notebooks, reports, and documents - Standard Markdown structure with YAML header
- Less flexibility in structuring documents
+ Notebook-style interactivity - Limited interactivity
Interactivity + Interactive code chunks + Supports interactive code chunks (with R)
+ Data visualization with JavaScript - Limited interactivity with other languages
Output Formats + Multiple output formats (HTML, PDF, Word) + Supports various output formats
+ Customizable templates - Templates can be customized
Extensibility and Ecosystem + Integration with the Quarto ecosystem + Established R Markdown ecosystem with numerous packages
+ Growing community support
Learning Curve + Shorter learning curve for beginners - Steeper learning curve for non-R users
+ Easier for non-programmers - More programming knowledge required