Create a python script to determine the performance and accuracy of the datasets

rtiop commented 1 year ago

What exactly are the requirements for this. What input should the script accept and what output should it provide?

Gladwin001 commented 1 year ago

Currently these are the following parameters we are considering, 1.Data Outliers ( can use Z-Score) 2.Data Bias and Skew Level in dataset 3.Data Timeliness (like check the data's update level eg:- recent data) 4.Data Linearity ( *keep this at last ) Note: For Initial Phase these parameters should analysed the JSON and CSV file formats later we can expand.

Your Contributions and Suggestions are most welcomed

rtiop commented 1 year ago

Ok. So given a file, said parameters should be evaluated for each numerical column, that's it? I have some questions about how to find a solution to implementing this:

For requirement 1, at what z-score should a value be considered a data outlier? 1, 2, 3?
How should I evaluate the data bias and skew levels based on the data from a file?
As for the timeliness, you want to know when a file was last modified, that's right? How can that be checked?
Could you please explain what you mean by data linearity? It's not totally clear how to evaluate it.

I will try to implement this, but it's a complex problem and I'm afraid I might not have the skills, so please excuse me if I can't find a solution and I end up throwing the towel.

Gladwin001 commented 1 year ago

Appreciation for your interest, Yes for a given file if the column is find as Numeric column then these parameter should be evaluated.

For Outlier in Initial you can try with Threshold Value of 3 for Z-Score.
I too worked for the Logic for it , Since at each project the bias and its need constantly change it's hard to derive a common solution, sure will work and find a logic for it and let you know bro.
Timeliness here refer to Recentness of a data for example if a dataset contain data up to 2020 when we try to validate dataset if we can find the data are relatively Old or New it's easy for user to choose which will best suit for his/her Project.
And your idea of Last modified is also nice.User can know the constant update of dataset through it.
I'm not very much confident in its usage. But in some project of ML, Linearity of data is checked to find right model to fit.

You can try as much as you can bro, there is no problem , we can collaborate and work on it. I hope you get some clarity.

rtiop commented 1 year ago

Thank you for your help @Gladwin001 . I've completed a script that finds the outliers in the data. It reads a file, finds its numerical columns, calculate mean and standard deviation of all columns to calculate Z-scores and then finds the outliers. The outliers are then put into a list that is then nicely printed on the terminal.

Right now, the code lives on the fork I created this morning in a separate branch as demanded in the README for this project. It only implements the first requirement? Should I request a pull now or should I delay it and try to implement the other features before creating the pull request?

To be honest, I think the first requirement was probably the easiest and the other ones will require much more cleverness and resourcefulness. I'm now going to try to find a way to solve another one, but I don’t guarantee I'll find anything. Have you made any progress?

neokd commented 1 year ago

You can implement one feature and then make a pull request.

neokd commented 1 year ago

@rtiop are u working on this issue?

rtiop commented 1 year ago

How should it be implemented? I haven't made any progress on this since I implemented find_outliers.py. The issue is that I cannot think of a way to evaluate the four parameters @Gladwin001 mentioned when I first started working on this. Here are a few problems we face:

Data outliers: this was the only one I was able to implement.
Data Bias and Skew Level: To find out about this would require comparing a file against a truly representative dataset. A script that only has access to one file cannot know if it is biased or skewed.
Data Timeliness: Since each file, be it CSV or JSON, is structured differently, there is no way to create a script that consistently and accurately returns the timeliness of any file.
Data Linearity: I still do not know what it is, how to measure it and what are its uses.

It's a complicated issue. But could I know how all of this will be used later? Knowing what exactly is needed would help me know what it is that we are trying to achieve.

Gladwin001 commented 1 year ago

Once again thanks for your contribution, Appreciate your contribution via _findoutliers.py. Here are some answers for your doubts,

Data Outlier: Implemented through Z-Score and Threshold Value
Data Bias and Skew Level: I have working on it. I think we can find it with Quartile Range Concept and I will try to Implement it in short days.
Data Timeliness: There are multiple ways to find timeliness and as you mentioned each are structured differently so try to find an alternative ways.
Data Linearity: Currently I have not focus on this parameter, In future if it didn't have proper use case we can discard it.

What is our exact need is, If an developer searching for an dataset, We like to give as much as useful parameter to developer which helps to choose a right dataset for his project.You can also suggest any other useful parameters to us.

rtiop commented 1 year ago

The issue with data timeliness

@Gladwin001 I've taken a look at most of the data sets, and most of them do not have any timestamp or any information that could tell us more about its timeliness. A way to address this problem would be to manually check all data sets and match all of them to a date, but that's unpractical and not a good idea. Therefore I cannot write a Python script that consistently finds the data timeliness. The underlying issue is that we do not have the metadata to solve this.

Metadata

A way to address this issue would be for all data sets to have an accompanying CSV/JSON or .txt file that includes useful metadata about the file (such as its origin, a timestamp, a description, etc). To implement this would require a lot of work, but it would empower us to give all this information to the developers. Nevertheless, I think it should be done. It would be a worthwhile investment, albeit a costly one. I know it's not a simple issue but it would be good for DataStorehouse. It would let us give more information to the users.

At a lower level, it would require us to find this information for already existing files and make contributors find it for new files, so will be hard to implement and a lot of hours put into it.

What I've done

Given the aim to give as much as useful parameter to developers, I've refactored my code to make it easier to add new functions to find data. I'm going to create a pull request so you can all see the details, but broadly it makes it easier to add data-finding functions to a unified script (analysis.py).

Note: I won't be able to work on this repository for some weeks starting Thursday since I'll be on vacations. Thank you for your understanding.

Summary

We should rethink the way we try to solve this issue.
Having a standardised way to store metadata would help
I'm going to create a pull request to refactor some code and make it easier to implement new functions

neokd commented 1 year ago

Thanks for your contribution @rtiop

neokd / DataStorehouse