test the pipelines data

OriHoch commented 1 year ago

we generate a lot of data, but is the data even any good? need to test it..

yanirmr commented 1 year ago

Hello @OriHoch,

I hope you're doing well! I came across this issue and I'd like to contribute by helping to refine the task and make it more feasible. To better understand the scope and specific requirements of this task, I have a few questions that will help us determine the best approach:

Data Sources: What are the primary data sources being used in the pipeline? Are there any specific data formats or structures we should be aware of?
Data Quality Metrics: What criteria should we use to evaluate the quality of the data? For example, are there certain aspects such as completeness, accuracy, consistency, or timeliness that are more important for this project?
Data Validation Techniques: Are there any specific validation techniques or tools you'd like us to use for testing the quality of the data? Or are we free to explore and suggest our own approaches?
Sample Data: Could you provide a sample dataset or a subset of the data generated by the pipeline for us to better understand the nature of the data and perform some preliminary tests?
Test Plan: Do you have any guidelines or suggestions for creating a test plan that covers the various aspects of data quality evaluation? This will help ensure that we are thorough in our testing efforts.
Expected Outcomes: What would you consider a successful outcome for this task? Are there any specific improvements or insights?

Lastly, I'd like to suggest breaking down this task into several subtasks, which can help make the process more manageable and allow for better tracking of progress. Here are a few examples of subtasks:

Subtask 1: Identify data quality dimensions (e.g., completeness, accuracy, consistency, timeliness) and define evaluation criteria for each dimension.
Subtask 2: Review the current pipeline's architecture, data sources, and data transformations to identify potential areas of improvement.
Subtask 3: Develop and execute test plans for each data quality dimension using the chosen evaluation criteria.
Subtask 4: Analyze test results and compile a report outlining the data quality issues identified, along with recommendations for improving the pipeline's data generation process.

Please let me know if these questions and suggestions resonate with your vision for the task or if you have any additional thoughts or concerns. I look forward to your response and working together to improve the quality of the data generated by this pipeline.

OriHoch commented 1 year ago

Data Sources: What are the primary data sources being used in the pipeline? Are there any specific data formats or structures we should be aware of?

Everything is defined in the pipeline yamls, most of the data is from the Knesset APIs. For example this yaml. In it you will see the first pipeline - kns_committee - the fields and source API are defined in that yaml. The data sources are explained in detail here, specifically, this document.

Data Quality Metrics: What criteria should we use to evaluate the quality of the data? For example, are there certain aspects such as completeness, accuracy, consistency, or timeliness that are more important for this project?

We haven't defined any specific metrics, but all of those are important, most important I guess is accuracy

Data Validation Techniques: Are there any specific validation techniques or tools you'd like us to use for testing the quality of the data? Or are we free to explore and suggest our own approaches?

You are free to explore and suggest.

Sample Data: Could you provide a sample dataset or a subset of the data generated by the pipeline for us to better understand the nature of the data and perform some preliminary tests?

All the data is public and available in SQL via Redash and CSV files, how to access it is described in the website homepage - https://oknesset.org/

Test Plan: Do you have any guidelines or suggestions for creating a test plan that covers the various aspects of data quality evaluation? This will help ensure that we are thorough in our testing efforts.

We don't have any definitions, but you can see which data is more interesting / important by looking at the user surveys and site specs which are linked to in this issue. Also, it's worth to talk with Assaf Shapira which has some ideas regarding what to do with the data and how to analyze it.

Expected Outcomes: What would you consider a successful outcome for this task? Are there any specific improvements or insights?

A successful outcome would be to know that some part of the data (e.g. committees data) is accurate and complete, or if you open bugs for the data.

Lastly, I'd like to suggest breaking down this task into several subtasks, which can help make the process more manageable and allow for better tracking of progress.

I invited you to the organization, you should have permissions to open issues, feel free to open issues for subtasks

Please let me know if these questions and suggestions resonate with your vision for the task or if you have any additional thoughts or concerns. I look forward to your response and working together to improve the quality of the data generated by this pipeline.

Sounds good, it would be really useful to have someone define and apply methodologies which will ensure the quality of our data!

OriHoch commented 1 year ago

I'm assigning the issue to you, doesn't mean you neccesarily have to implement everything, but I think it would be good if you could centralize the efforts for it and direct other developers that might want to help..

hasadna / knesset-data-pipelines

test the pipelines data #206