soma-smart / Fakelake

Generate massive fake datasets for your datalake, fast. By SOMA
https://soma-smart.github.io/Fakelake/
MIT License
17 stars 1 forks source link

Add json output #43

Closed bhagenbourger closed 4 months ago

bhagenbourger commented 5 months ago

32

bhagenbourger commented 4 months ago

Hello @vianneybacoup , In this PR I put all test results into "target" folder excepted "output.*" because it's the default value for the output_name. I do that to keep root folder "clean", to easier clean test results and also to avoid to add all extensions into the .gitignore file. But, as default value is "output", I don't really know if it's really useful. What do you think about that?

Thank you for your feedback.

vianneybacoup commented 4 months ago

Hello @vianneybacoup , In this PR I put all test results into "target" folder excepted "output.*" because it's the default value for the output_name. I do that to keep root folder "clean", to easier clean test results and also to avoid to add all extensions into the .gitignore file. But, as default value is "output", I don't really know if it's really useful. What do you think about that?

Thank you for your feedback.

I see that we both had the same issues, I did way too much 'rm .parquet .csv' past months haha I had in mind we could have like in Python a tearDown fucntion that cleanup the output of the test automatically, but that does not exist in rust, and this is not a mandatory feature, the folder is clearly enough. Maybe rename it with target/test_generated or something like this just to avoid a possible confusion with the other targets?

bhagenbourger commented 4 months ago

I moved output tests into target/test_generated/ folder. I kept "all options" tests into target folder because they are run only by github action and generation fails if folder not exists (maybe could be a good enhancement to automatically generate intermediate folders?). I added ctor crate to add macro to create target/test_generated/ folder before tests and clean test outputs after tests.

About json output, I did a POC using arrow (https://github.com/bhagenbourger/Fakelake/tree/poc/use_arrow_for_all_format) because I found interesting to use the same "generator" for all formats. It works but with some limitations as using the same date format for all columns. So definitively, serde-json is more flexible and I keep this implementation.