yixuy commented 2 years ago

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

This PR is focusing on adding the feature for clean JSON, which makes the JSON file split into more columns for user to use.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

import pandas as pd
import numpy as np
df = pd.DataFrame({"Name":
                   ["Abby", "Scott", "Scott", "Scott2", np.nan, "NULL"],
                   "AGE":
                   [12, 33, 33, 56,  np.nan, "NULL"],
                   "weight__":
                   [32.5, 47.1, 47.1, 55.2, np.nan, "NULL"],
                   "Admission Date":
                   ["2020-01-01", "2020-01-15", "2020-01-15",
                    "2020-09-01", pd.NaT, "NULL"],
                   "email_address":
                   ["abby@gmail.com","scott@gmail.com", "scott@gmail.com", "test@abc.com", np.nan, "NULL"],
                   "Country of Birth":
                   ["USA","Canada", "Canada", "NULL", np.nan, "NULL"],
                  "Contact (Numbers)":
                   ["1-789-456-0123","1-123-456-7890","1-123-456-7890","86-456-123-7890", np.nan, "NULL" ],
                   "address": [
                        "123 Pine Ave.",

                        "1234 west main heights 57033",
                        "apt 1 789 s maple rd manhattan",

                        "1111 S Figueroa St, Los Angeles, CA 90015",

                        np.nan,
                        "NULL"
                    ],
                   'url': ['https://github.com/sfu-db/dataprep'
                    ,'https://github.com/sfu-db/dataprep',  
                    'https://github.com/sfu-db/dataprep' , 
                    'https://github.com/sfu-db/dataprep',
                    'https://github.com/sfu-db/dataprep',
                    'https://github.com/sfu-db/dataprep' ],
                   "messy_json": [
                  '{"name": "jane doe", "salary": 9000, "email": "jane.doe@prative.com"}',
                  '{"name": "doe", "salary":8000, "email": "hello@prative.com"}',
                  '{"name": "Doe", "salary": 200, "email": "google@prative.com"}',
                  '{"name": "Fire", "salary": 11000, "email": "jane@prative.com"}',
                  '{"name": "Sam", "salary": 4000, "email": "doe@prative.com"}',
                  '{"name": "Ethan", "salary": 2000, "email": "Research@prative.com"}'
                        ]
})
df
clean_json(df, "messy_json")

Snapshots:

Include snapshots for easier review.

Checklist:

[X] My code follows the style guidelines of this project
[X] I have already squashed the commits and made the commit message conform to the project standard.
[ ] I have already marked the commit with "BREAKING CHANGE" or "Fixes #" if needed.
[X] I have performed a self-review of my own code
[X] I have commented on my code, particularly in hard-to-understand areas
[X] I have made corresponding changes to the documentation
[X] My changes generate no new warnings
[X] I have added tests that prove my fix is effective or that my feature works
[ ] New and existing unit tests pass locally with my changes
[ ] Any dependent changes have been merged and published in downstream modules

codecov[bot] commented 2 years ago

Codecov Report

Merging #915 (7910ed1) into develop (a7bf820) will increase coverage by 0.03%. The diff coverage is 72.26%.

:exclamation: Current head 7910ed1 differs from pull request most recent head 4717388. Consider uploading reports for the commit 4717388 to get more accurate results

@@             Coverage Diff             @@
##           develop     #915      +/-   ##
===========================================
+ Coverage    56.01%   56.05%   +0.03%     
===========================================
  Files          314      316       +2     
  Lines        20383    20460      +77     
===========================================
+ Hits         11418    11469      +51     
- Misses        8965     8991      +26

Impacted Files	Coverage Δ
dataprep/clean/clean_url.py	`97.84% <ø> (ø)`
dataprep/clean/gui/clean_gui.py	`29.13% <0.00%> (-0.04%)`	:arrow_down:
dataprep/clean/clean_json.py	`30.76% <30.76%> (ø)`
dataprep/eda/create_db_report/report.py	`42.85% <42.85%> (ø)`
dataprep/eda/create_db_report/__init__.py	`70.00% <66.66%> (+6.36%)`	:arrow_up:
dataprep/eda/create_db_report/views/constraint.py	`92.30% <80.00%> (ø)`
dataprep/clean/clean_phone.py	`95.69% <95.00%> (+0.46%)`	:arrow_up:
dataprep/clean/__init__.py	`100.00% <100.00%> (ø)`
dataprep/clean/utils.py	`40.62% <100.00%> (+0.62%)`	:arrow_up:
...rep/eda/create_db_report/db_models/table_column.py	`90.24% <100.00%> (ø)`
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d21a3d1...4717388. Read the comment docs.

qidanrui commented 2 years ago

Can we handle nested Json like '{"name": {"first_name": "jane", "last_name": "doe"}, "salary": 9000, "email": "jane.doe@prative.com"}'

yixuy commented 2 years ago

I think it will handle this nested case the result is as follows The name will be Nan and there are more columns called first name and last name Correct me if there is anything wrong

qidanrui commented 2 years ago

I think it will handle this nested case the result is as follows The name will be Nan and there are more columns called first name and last name Correct me if there is anything wrong

Hmmm, first name and last name are NaNs ?

yixuy commented 2 years ago

I think it will handle this nested case the result is as follows The name will be Nan and there are more columns called first name and last name Correct me if there is anything wrong

Hmmm, first name and last name are NaNs ?

The test cases are as follows:

The case you have mentioned is an object for the name field "name": {"first_name": "jane", "last_name": "doe"} Therefore, it will be split new field called the first_name and last_name and leave NaN in the name field. By contrast, if the name field is like "name": "jane doe". It will filled in the name field rather than NaN. Please let me know if I understand incorrectly, thank you!

qidanrui commented 2 years ago

I think it will handle this nested case the result is as follows The name will be Nan and there are more columns called first name and last name Correct me if there is anything wrong

Hmmm, first name and last name are NaNs ?

The test cases are as follows:

The case you have mentioned is an object for the name field "name": {"first_name": "jane", "last_name": "doe"} Therefore, it will be split new field called the first_name and last_name and leave NaN in the name field. By contrast, if the name field is like "name": "jane doe". It will filled in the name field rather than NaN. Please let me know if I understand incorrectly, thank you!

Great! After reviewing the code, I think we can merge it~

sfu-db / dataprep

Implement clean json #915

Description

How Has This Been Tested?

Snapshots:

Checklist:

Codecov Report