hasadna / knesset-data-pipelines

Main repository for Open Knesset project - contains the knesset data scrapers and processing pipelines
https://oknesset.org/
MIT License
13 stars 26 forks source link

test the pipelines data #206

Open OriHoch opened 1 year ago

OriHoch commented 1 year ago

we generate a lot of data, but is the data even any good? need to test it..

yanirmr commented 1 year ago

Hello @OriHoch,

I hope you're doing well! I came across this issue and I'd like to contribute by helping to refine the task and make it more feasible. To better understand the scope and specific requirements of this task, I have a few questions that will help us determine the best approach:

  1. Data Sources: What are the primary data sources being used in the pipeline? Are there any specific data formats or structures we should be aware of?
  2. Data Quality Metrics: What criteria should we use to evaluate the quality of the data? For example, are there certain aspects such as completeness, accuracy, consistency, or timeliness that are more important for this project?
  3. Data Validation Techniques: Are there any specific validation techniques or tools you'd like us to use for testing the quality of the data? Or are we free to explore and suggest our own approaches?
  4. Sample Data: Could you provide a sample dataset or a subset of the data generated by the pipeline for us to better understand the nature of the data and perform some preliminary tests?
  5. Test Plan: Do you have any guidelines or suggestions for creating a test plan that covers the various aspects of data quality evaluation? This will help ensure that we are thorough in our testing efforts.
  6. Expected Outcomes: What would you consider a successful outcome for this task? Are there any specific improvements or insights?

Lastly, I'd like to suggest breaking down this task into several subtasks, which can help make the process more manageable and allow for better tracking of progress. Here are a few examples of subtasks:

Please let me know if these questions and suggestions resonate with your vision for the task or if you have any additional thoughts or concerns. I look forward to your response and working together to improve the quality of the data generated by this pipeline.

OriHoch commented 1 year ago
  1. Data Sources: What are the primary data sources being used in the pipeline? Are there any specific data formats or structures we should be aware of?

Everything is defined in the pipeline yamls, most of the data is from the Knesset APIs. For example this yaml. In it you will see the first pipeline - kns_committee - the fields and source API are defined in that yaml. The data sources are explained in detail here, specifically, this document.

  1. Data Quality Metrics: What criteria should we use to evaluate the quality of the data? For example, are there certain aspects such as completeness, accuracy, consistency, or timeliness that are more important for this project?

We haven't defined any specific metrics, but all of those are important, most important I guess is accuracy

  1. Data Validation Techniques: Are there any specific validation techniques or tools you'd like us to use for testing the quality of the data? Or are we free to explore and suggest our own approaches?

You are free to explore and suggest.

  1. Sample Data: Could you provide a sample dataset or a subset of the data generated by the pipeline for us to better understand the nature of the data and perform some preliminary tests?

All the data is public and available in SQL via Redash and CSV files, how to access it is described in the website homepage - https://oknesset.org/

  1. Test Plan: Do you have any guidelines or suggestions for creating a test plan that covers the various aspects of data quality evaluation? This will help ensure that we are thorough in our testing efforts.

We don't have any definitions, but you can see which data is more interesting / important by looking at the user surveys and site specs which are linked to in this issue. Also, it's worth to talk with Assaf Shapira which has some ideas regarding what to do with the data and how to analyze it.

  1. Expected Outcomes: What would you consider a successful outcome for this task? Are there any specific improvements or insights?

A successful outcome would be to know that some part of the data (e.g. committees data) is accurate and complete, or if you open bugs for the data.

Lastly, I'd like to suggest breaking down this task into several subtasks, which can help make the process more manageable and allow for better tracking of progress.

I invited you to the organization, you should have permissions to open issues, feel free to open issues for subtasks

Please let me know if these questions and suggestions resonate with your vision for the task or if you have any additional thoughts or concerns. I look forward to your response and working together to improve the quality of the data generated by this pipeline.

Sounds good, it would be really useful to have someone define and apply methodologies which will ensure the quality of our data!

OriHoch commented 1 year ago

I'm assigning the issue to you, doesn't mean you neccesarily have to implement everything, but I think it would be good if you could centralize the efforts for it and direct other developers that might want to help..