sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2.07k stars 206 forks source link

Implement clean json #915

Closed yixuy closed 2 years ago

yixuy commented 2 years ago

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

This PR is focusing on adding the feature for clean JSON, which makes the JSON file split into more columns for user to use.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

import pandas as pd
import numpy as np
df = pd.DataFrame({"Name":
                   ["Abby", "Scott", "Scott", "Scott2", np.nan, "NULL"],
                   "AGE":
                   [12, 33, 33, 56,  np.nan, "NULL"],
                   "weight__":
                   [32.5, 47.1, 47.1, 55.2, np.nan, "NULL"],
                   "Admission Date":
                   ["2020-01-01", "2020-01-15", "2020-01-15",
                    "2020-09-01", pd.NaT, "NULL"],
                   "email_address":
                   ["abby@gmail.com","scott@gmail.com", "scott@gmail.com", "test@abc.com", np.nan, "NULL"],
                   "Country of Birth":
                   ["USA","Canada", "Canada", "NULL", np.nan, "NULL"],
                  "Contact (Numbers)":
                   ["1-789-456-0123","1-123-456-7890","1-123-456-7890","86-456-123-7890", np.nan, "NULL" ],
                   "address": [
                        "123 Pine Ave.",

                        "1234 west main heights 57033",
                        "apt 1 789 s maple rd manhattan",

                        "1111 S Figueroa St, Los Angeles, CA 90015",

                        np.nan,
                        "NULL"
                    ],
                   'url': ['https://github.com/sfu-db/dataprep'
                    ,'https://github.com/sfu-db/dataprep',  
                    'https://github.com/sfu-db/dataprep' , 
                    'https://github.com/sfu-db/dataprep',
                    'https://github.com/sfu-db/dataprep',
                    'https://github.com/sfu-db/dataprep' ],
                   "messy_json": [
                  '{"name": "jane doe", "salary": 9000, "email": "jane.doe@prative.com"}',
                  '{"name": "doe", "salary":8000, "email": "hello@prative.com"}',
                  '{"name": "Doe", "salary": 200, "email": "google@prative.com"}',
                  '{"name": "Fire", "salary": 11000, "email": "jane@prative.com"}',
                  '{"name": "Sam", "salary": 4000, "email": "doe@prative.com"}',
                  '{"name": "Ethan", "salary": 2000, "email": "Research@prative.com"}'
                        ]
})
df
clean_json(df, "messy_json")

Snapshots:

image

Include snapshots for easier review.

Checklist:

codecov[bot] commented 2 years ago

Codecov Report

Merging #915 (7910ed1) into develop (a7bf820) will increase coverage by 0.03%. The diff coverage is 72.26%.

:exclamation: Current head 7910ed1 differs from pull request most recent head 4717388. Consider uploading reports for the commit 4717388 to get more accurate results

@@             Coverage Diff             @@
##           develop     #915      +/-   ##
===========================================
+ Coverage    56.01%   56.05%   +0.03%     
===========================================
  Files          314      316       +2     
  Lines        20383    20460      +77     
===========================================
+ Hits         11418    11469      +51     
- Misses        8965     8991      +26     
Impacted Files Coverage Δ
dataprep/clean/clean_url.py 97.84% <ø> (ø)
dataprep/clean/gui/clean_gui.py 29.13% <0.00%> (-0.04%) :arrow_down:
dataprep/clean/clean_json.py 30.76% <30.76%> (ø)
dataprep/eda/create_db_report/report.py 42.85% <42.85%> (ø)
dataprep/eda/create_db_report/__init__.py 70.00% <66.66%> (+6.36%) :arrow_up:
dataprep/eda/create_db_report/views/constraint.py 92.30% <80.00%> (ø)
dataprep/clean/clean_phone.py 95.69% <95.00%> (+0.46%) :arrow_up:
dataprep/clean/__init__.py 100.00% <100.00%> (ø)
dataprep/clean/utils.py 40.62% <100.00%> (+0.62%) :arrow_up:
...rep/eda/create_db_report/db_models/table_column.py 90.24% <100.00%> (ø)
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d21a3d1...4717388. Read the comment docs.

qidanrui commented 2 years ago

Can we handle nested Json like '{"name": {"first_name": "jane", "last_name": "doe"}, "salary": 9000, "email": "jane.doe@prative.com"}'

yixuy commented 2 years ago

I think it will handle this nested case the result is as follows The name will be Nan and there are more columns called first name and last name Correct me if there is anything wrong image

qidanrui commented 2 years ago

I think it will handle this nested case the result is as follows The name will be Nan and there are more columns called first name and last name Correct me if there is anything wrong image

Hmmm, first name and last name are NaNs ?

yixuy commented 2 years ago

I think it will handle this nested case the result is as follows The name will be Nan and there are more columns called first name and last name Correct me if there is anything wrong image

Hmmm, first name and last name are NaNs ?

The test cases are as follows:

image image

The case you have mentioned is an object for the name field "name": {"first_name": "jane", "last_name": "doe"} Therefore, it will be split new field called the first_name and last_name and leave NaN in the name field. By contrast, if the name field is like "name": "jane doe". It will filled in the name field rather than NaN. Please let me know if I understand incorrectly, thank you!

qidanrui commented 2 years ago

I think it will handle this nested case the result is as follows The name will be Nan and there are more columns called first name and last name Correct me if there is anything wrong image

Hmmm, first name and last name are NaNs ?

The test cases are as follows: image

image

The case you have mentioned is an object for the name field "name": {"first_name": "jane", "last_name": "doe"} Therefore, it will be split new field called the first_name and last_name and leave NaN in the name field. By contrast, if the name field is like "name": "jane doe". It will filled in the name field rather than NaN. Please let me know if I understand incorrectly, thank you!

Great! After reviewing the code, I think we can merge it~