nasa-petal / data-collection-and-prep

Starting with a list of URLs of papers that can be used for crowdsourcing, create a CSV file with the URL, DOI of the paper, Title, Abstract, and if the paper is open access
The Unlicense
1 stars 5 forks source link

Auto generated data quality report #93

Open bruffridge opened 3 years ago

bruffridge commented 3 years ago

Use Great Expectations or a custom script that takes the golden.json and generates a data quality report showing:

Missing data, duplicate data, or data formatting issues. See Herb's google doc for an example report.

bruffridge commented 3 years ago

Screen Shot 2021-07-20 at 4 03 23 PM

dsmith111 commented 2 years ago

Resolved with the deployment of the PeTaL Labeler pipeline repo: https://github.com/nasa-petal/petal-labeler-data-pipeline

However, the image shows the inclusion of a JSON DIFF report. I've mentioned this in the README file within the above repository under future works. @bruffridge I'd still say this could be closed and just have a separate task for adding in a JSON DIFF stage to the pipeline (Or just edit #94 ).