Create some Dataset Preprocessor

quantum-entangled commented 2 years ago

Create an intermediate structure of preprocessing the given dataset before passing it into a program.

Main concerns:

Define acceptable file formats.
Define acceptable data types within dataset.
Define acceptable dataset structure, i.e. delimiters, missing values, etc.

CKPEIIKA commented 2 years ago

We can implement readers and writers as separate modules and firstly come from the end:

what formats are most popular in the field
what formats were already used (ex: datasets from the previous version of the project) and then implement something else by request of real users. IMHO any automatic format identification is a waste of time.

quantum-entangled commented 2 years ago

All right, I'd like to clarify that from the first days we agreed to ask our users to upload the files only in .txt. That's why I've limited the file chooser to only these. And that's why I was wondering, why do we need the other ones. So there is no discussion about automatic format identification. We just need to either ask users to do necessary preprocessing manually before uploading (and fully describe the requirements in docs, e.g.) or just write our own basic preprocessor, which again doesn't eliminate some constraints about the data not being completely messed up.

CKPEIIKA commented 2 years ago

I can take a look at that. Should i implement it in "DataManager.py" or something will change?

quantum-entangled commented 2 years ago

Note that the whole app is divided by 'front' and 'back' parts (some parts are poorly for the moment, especially in the data prep section, cause the architecture wasn't clear at the start, but it's no big deal to fix). So you probably want to change/expand this Data Manager part: https://github.com/quantum-entangled/gam-ml-test/blob/5a283d93087d00ab087326b333bba4ddae1d3f8a/src/Managers/DataManager.py#L25-L46

Feel free to write any code, I will examine it anyway to match the code style (you can check it right within the class and it's worth mentioning that I'm using Python black with VSCode auto-formatting), but please create a distinct git branch from dev and start PR when you ready.

quantum-entangled commented 1 year ago

Hello @CKPEIIKA, do you have any updates? I have a few new thoughts on this.

CKPEIIKA commented 1 year ago

Its in initial state, so I can redo it from scratch anyway.

quantum-entangled commented 1 year ago

Its in initial state, so I can redo it from scratch anyway.

All right, so I have some suggestions. I would like to add splitting dataset functionality so its parts could be used inside different stages. And this leads me to data storing refactoring.

In my opinion, we should upload the file in pd dataframe format because it's convenient for data manipulation functions and displaying the table, you can already see that I'm using it in DataGrid cause that's the only format it accepts. Yes, dfs are less efficient in terms of memory and time than np arrays, but we a priori allow for many natural application limitations (e.g. Binder allocate only 8 Gb of Ram) and suppose users to not have a reallyyy big datasets cause it's inconvenient to use this app then. So storing the file in df, but using np arrays for train and test splits seems reasonable to me.

Moreover, we've agreed to switch to csv/tsv file format, so pd read_csv is very suitable here and we should more likely focus on checking the right column types for now (e.g. not include srtings / dates / special characters etc.) and N/As presence.

CKPEIIKA commented 1 year ago

I agree that working with df dataframes is much easier. But it seems to me that it requires to change a data class (at least there is no need to store a headers and everything should now expect a dataframe.to_numpy).

By the way: what is a benefit in using ipyfilechooser? An option: https://ipywidgets.readthedocs.io/en/stable/examples/Widget%20List.html#File-Upload

quantum-entangled commented 1 year ago

I agree that working with df dataframes is much easier. But it seems to me that it requires to change a data class (at least there is no need to store a headers and everything should now expect a dataframe.to_numpy).

Sure, it was an early stage "kludge" to process headers and data like that, so I will start refactoring everything else.

By the way: what is a benefit in using ipyfilechooser? An option: https://ipywidgets.readthedocs.io/en/stable/examples/Widget%20List.html#File-Upload

Yes, that was my first option, but I quickly realized that it has 10 MB file size limitation, and none of our previous files satisfied that. There are some workarounds which tweak jupyter config file, but I've tested it and the file upload (around 100 MB) just crashed the kernel. In some similar issue an ipyfilechooser was suggested and I thought that it suits our UI well, cause you can upload the file via Jupyter server directly.

CKPEIIKA commented 1 year ago

After a messing a little with preprocessing, i've got some questions:

How error handling should be implemented for now? When i am printing a warning i feel i am going the wrong way.
How the testing would be implemented? I am close to begin writing them at least locally.
If we found wrong columns or NaNs should we just drop an error or offer to exclude rows and columns?
Should we accept only csv and tsv with int and float values? Should we accept csv with ";" or " " separators and "," decimal delimiters?
Why uploading and datagrid are separated? what is supposed to be in these modules later? It can be more convenient to get instantly some type of preview of the current file rather than switching between tabs.

quantum-entangled commented 1 year ago

As you can see, our "errors" are not quite errors. They are just some conditions within which we display some messages to users in order to not scare them. So what I mean by "error handler" is to have some generalization of different messages types, which we can use in different parts and don't mess up / repeat code. You can see in my last commit that I added Enums Messages for that purpose, plus replaced printing from managers to UI. It has nothing with things like try-except's, which could have a great use in code itself, but shouldn't display any "big errors" in terms of UI, these are just for development.
I want to use pytest for unit-testing, but I encourage you to wait with that a little bit more, as I still change code often, and that could lead to errors. I will write basic tests when it's ready, and you can then add any cases, etc.
That was one of my questions too. I'm not sure if we should allow users to make changes in file via UI itself, cause it could later lead to some redundant code extensions, like writing the whole file editor. Yet again, we have some reasonable limitations in app functionality, but I don't know if there is a need for adding at least basic things, like NaNs handler, columns-dropper and so on.
For now we definitely should accept only ints and floats, as models are designed to only accept these and nothing else. As for delimiters, I think "," is the most common anyway, I'm not sure about ";", but blank spaces are not acceptable, as the are very error-prone, especially in headers processing.
The idea of every widget module (at least in my mind) is to separate the workflows. So uploading is meant for just loading the file in memory, and then you can examine it via the other widgets. DataGrid allows one to sort/filter and more with its library. But it's expected that user doesn't see his dataset for the first time, so the DataGrid isn't directly intended for finding some NaNs/dates/strings and stuff like that, it just shows some columns/statistics, etc. Once again that is more a question of users' datasets limitations.

quantum-entangled / machine-learning-ui

Create some Dataset Preprocessor #21