Identify missing csv data preprocessing methods and implement the methods with a test case | Generic Issue - Not to be assigned

klarEDA / klar-EDA

A python library for automated exploratory data analysis

https://klareda.github.io/klar-EDA/

MIT License

17 stars 22 forks source link

Identify missing csv data preprocessing methods and implement the methods with a test case | Generic Issue - Not to be assigned #12

Open Ask149 opened 3 years ago

Ask149 commented 3 years ago

Description

Identify the missing methods of CSV data preprocessing in this repository.
Find suitable cases of data and machine learning problems for which the method should be used.
Implement the method only for those cases.

Please Note - This is a generic issue and multiple students can work on the same. Notify the mentors once you identify a method (as mentioned above). The mentor will create a separate issue and assign you the same.

Contribution guidelines will be updated soon. Please refer them for guidance before committing any development work.

rubyruins commented 3 years ago

Hi, I would like to work on this for GSSOC. Could you give an example of what kinds of missing data visualization methods you are looking to implement?

Ask149 commented 3 years ago

Hi @rubyruins, thank you for your interest. So, if you take a look at csv_preprocess.py, there are methods such as fill numerical na, normalize numerical columns, label encode categorical columns, etc. Identify if there are any novel methods that we might have missed out on already and should be included. One of them I can think of for now is - Identify the format of a date column then extract the month, day of the week, date, year, etc. from the same and append the same into the column list.

rubyruins commented 3 years ago

@Ask149 sounds good. Do let me know if there are any other examples you can think of. If you can create an issue for those, I can start working on them. Hopefully, we can discuss it today evening!

ashish-hacker commented 3 years ago

@Ask149 I checked out csv_preprocess.py , and noticed there is only one method for scaling the features i.e., min-max normalisation. I think it would be better to add some more scaling methods like mean normalisation and standardization for Gaussian distributions to make it more flexible. May I work on the same? I am a participant in GSSOC'21.

Ask149 commented 3 years ago

Thank you for your interest @ashish-hacker! Could you please refer the issue #15 and add a similar short description before you start the implementation, we can discuss the same prior to your implementation and make it precise? I am assigning an issue under your name - Issue #16, use the same to add the details.

ashish-hacker commented 3 years ago

Sure @Ask149 : )