Closed Priyamakeshwari closed 1 year ago
Currently these are the following parameters we are considering, 1.Data Outliers ( can use Z-Score) 2.Data Bias and Skew Level in dataset 3.Data Timeliness (like check the data's update level eg:- recent data) 4.Data Linearity ( *keep this at last ) Note: For Initial Phase these parameters should analysed the JSON and CSV file formats later we can expand.
Your Contributions and Suggestions are most welcomed
Ok. So given a file, said parameters should be evaluated for each numerical column, that's it? I have some questions about how to find a solution to implementing this:
I will try to implement this, but it's a complex problem and I'm afraid I might not have the skills, so please excuse me if I can't find a solution and I end up throwing the towel.
Appreciation for your interest, Yes for a given file if the column is find as Numeric column then these parameter should be evaluated.
You can try as much as you can bro, there is no problem , we can collaborate and work on it. I hope you get some clarity.
Thank you for your help @Gladwin001 . I've completed a script that finds the outliers in the data. It reads a file, finds its numerical columns, calculate mean and standard deviation of all columns to calculate Z-scores and then finds the outliers. The outliers are then put into a list that is then nicely printed on the terminal.
Right now, the code lives on the fork I created this morning in a separate branch as demanded in the README for this project. It only implements the first requirement? Should I request a pull now or should I delay it and try to implement the other features before creating the pull request?
To be honest, I think the first requirement was probably the easiest and the other ones will require much more cleverness and resourcefulness. I'm now going to try to find a way to solve another one, but I don’t guarantee I'll find anything. Have you made any progress?
You can implement one feature and then make a pull request.
@rtiop are u working on this issue?
How should it be implemented? I haven't made any progress on this since I implemented find_outliers.py. The issue is that I cannot think of a way to evaluate the four parameters @Gladwin001 mentioned when I first started working on this. Here are a few problems we face:
It's a complicated issue. But could I know how all of this will be used later? Knowing what exactly is needed would help me know what it is that we are trying to achieve.
Once again thanks for your contribution, Appreciate your contribution via _findoutliers.py. Here are some answers for your doubts,
What is our exact need is, If an developer searching for an dataset, We like to give as much as useful parameter to developer which helps to choose a right dataset for his project.You can also suggest any other useful parameters to us.
@Gladwin001 I've taken a look at most of the data sets, and most of them do not have any timestamp or any information that could tell us more about its timeliness. A way to address this problem would be to manually check all data sets and match all of them to a date, but that's unpractical and not a good idea. Therefore I cannot write a Python script that consistently finds the data timeliness. The underlying issue is that we do not have the metadata to solve this.
A way to address this issue would be for all data sets to have an accompanying CSV/JSON or .txt file that includes useful metadata about the file (such as its origin, a timestamp, a description, etc). To implement this would require a lot of work, but it would empower us to give all this information to the developers. Nevertheless, I think it should be done. It would be a worthwhile investment, albeit a costly one. I know it's not a simple issue but it would be good for DataStorehouse. It would let us give more information to the users.
At a lower level, it would require us to find this information for already existing files and make contributors find it for new files, so will be hard to implement and a lot of hours put into it.
Given the aim to give as much as useful parameter to developers, I've refactored my code to make it easier to add new functions to find data. I'm going to create a pull request so you can all see the details, but broadly it makes it easier to add data-finding functions to a unified script (analysis.py).
Note: I won't be able to work on this repository for some weeks starting Thursday since I'll be on vacations. Thank you for your understanding.
Thanks for your contribution @rtiop
What exactly are the requirements for this. What input should the script accept and what output should it provide?