openclimatefix / .github

Various Community Health Files
1 stars 3 forks source link

Research Python tools for validating data #47

Open JackKelly opened 2 months ago

JackKelly commented 2 months ago

We'd like the code to automatically check that batches are of the correct shape, with the correct coordinates, and with the appropriate number of NaNs, zeros, etc. Historically, we have used pydantic. But we'd like to do some research into alternatively tools like attrs.

We decided this would be a good idea in OCF's internal "Data Engineering Big Ideas" meeting on 5th March 2024.

e.g mypy / pydantic / attrs

Related:

bikramb98 commented 2 months ago

@JackKelly Happy to take this on and get back to you with my findings. I couldn't access the Google Docs linked above. If they are meant to be public, can you please update their access settings.

JackKelly commented 2 months ago

That's very kind, thank you! The Google Docs aren't super-relevant, TBH.

Basically, what we'd like is a way to automatically check that our xarray objects have:

I'm not super-involved in this work, TBH. I just wrote up these GitHub issues.

noobjam commented 2 months ago

Hello! I'd also like to help on this.

JackKelly commented 2 months ago

Sounds good, thank you! Maybe what I'd suggest is starting a shared Google Doc with brief notes about the various data validation frameworks out there! Thanks so much! The idea is that multiple people can collaborate on the notes doc. Or, if you'd prefer, we could use the GitHub wiki in this project. Or a markdown file in this project.

Mahak-Agrawal-304 commented 2 months ago

Hello @JackKelly , I agree with your suggestion of sharing doc with brief notes so that multiple contributors can collaborate on the notes. I would really like to implement this idea

Mahak-Agrawal-304 commented 2 months ago

As for the beginning, I have started with a very raw document. Since it's my first time contributing, I will appreciate any constructive feedback!

JackKelly commented 2 months ago

That doc looks perfect, thank you! I think you're on the right track: list the various options, and give a brief summary of each option. Ideally, it'd be great to include a short code example showing how to define a data validation schema. Thanks so much! This is super-helpful!

Mahak-Agrawal-304 commented 2 months ago

Thank you for your feedback! I'll keep in mind to add code snippets henceforth. I will implement it right now