iterative / dataset-registry

Dataset registry DVC project
67 stars 39 forks source link

dataset .dvc files #33

Closed SoyGema closed 1 year ago

SoyGema commented 1 year ago

The final version contained: 3 datasets : train, test, answer . Therefore 3 .dvc files in its correspondent folder The datasets are stored remotely .

shcheklein commented 1 year ago

let's please make a subdirectory - workshop or workshops.

let's please add a brief README into subdirectory that explains where this dataset is being used, etc.

Thanks!

SoyGema commented 1 year ago

Q (just to make sure) - do we need to have 3 files or it's easier to do a single directory?

Hey @shcheklein . You made a substantial question.

In general, when facing the challenge of been given different datasets it is because the patterns or statistical distributions of training and testing can show differences that have impact on testing different model aspects, such as anomaly detection, overfitting and reactions to data drift. It might also ensure that the number of samples for each entity (in this case, Satellites) is tested.

Please, find this super quick exploration of how merging impacts, at least on train and test sizes

The train_test_split sklearn function doesn´t contemplate this. If this critera is deprecable in your opinion, let´s merge it.Let me know your thoughts and thanks!

shcheklein commented 1 year ago

@SoyGema my question was not about actual ML / DS implication based on statistics, etc. It's more about how do you use it - one single dvc get ... data vs multiple dvc gets - file by file.

SoyGema commented 1 year ago

@SoyGema my question was not about actual ML / DS implication based on statistics, etc. It's more about how do you use it - one single dvc get ... data vs multiple dvc gets - file by file.

Makes sense. Ill do a single file. Follow up with feedback from @dberenbaum and thumbs from @efiop about howto

SoyGema commented 1 year ago

Done

jorgeorpinel commented 1 year ago

Pls remove .DS_Store file 🙂

please add a brief README into subdirectory

Or maybe add a section to the main one?

shcheklein commented 1 year ago

Or maybe add a section to the main one?

this is less scalable I think, let's make it a readme per dataset

@SoyGema thanks, some minor things left! (I think it's not even that relevant for you anymore? sorry for all these delays)

SoyGema commented 1 year ago

Or maybe add a section to the main one?

this is less scalable I think, let's make it a readme per dataset

@SoyGema thanks, some minor things left! (I think it's not even that relevant for you anymore? sorry for all these delays)

No worries at all. If you are ok with this, I think that is still relevant to ensure reproducibility, consistency across other examples , and preparation for future workshops. Thanks @jorgeorpinel for being part of this, I really appreciate it. 🥇

jorgeorpinel commented 1 year ago

I'm just nosy.

SoyGema commented 1 year ago

Pls remove .DS_Store file 🙂

please add a brief README into subdirectory

Or maybe add a section to the main one?

Done

SoyGema commented 1 year ago
shcheklein commented 1 year ago

@SoyGema It's merged but: