Closed quantum-entangled closed 1 year ago
We can implement readers and writers as separate modules and firstly come from the end:
All right, I'd like to clarify that from the first days we agreed to ask our users to upload the files only in .txt
. That's why I've limited the file chooser to only these. And that's why I was wondering, why do we need the other ones. So there is no discussion about automatic format identification. We just need to either ask users to do necessary preprocessing manually before uploading (and fully describe the requirements in docs, e.g.) or just write our own basic preprocessor, which again doesn't eliminate some constraints about the data not being completely messed up.
I can take a look at that. Should i implement it in "DataManager.py" or something will change?
Note that the whole app is divided by 'front' and 'back' parts (some parts are poorly for the moment, especially in the data prep section, cause the architecture wasn't clear at the start, but it's no big deal to fix). So you probably want to change/expand this Data Manager part: https://github.com/quantum-entangled/gam-ml-test/blob/5a283d93087d00ab087326b333bba4ddae1d3f8a/src/Managers/DataManager.py#L25-L46
Feel free to write any code, I will examine it anyway to match the code style (you can check it right within the class and it's worth mentioning that I'm using Python black
with VSCode
auto-formatting), but please create a distinct git branch from dev
and start PR when you ready.
Hello @CKPEIIKA, do you have any updates? I have a few new thoughts on this.
Its in initial state, so I can redo it from scratch anyway.
Its in initial state, so I can redo it from scratch anyway.
All right, so I have some suggestions. I would like to add splitting dataset functionality so its parts could be used inside different stages. And this leads me to data storing refactoring.
In my opinion, we should upload the file in pd dataframe format because it's convenient for data manipulation functions and displaying the table, you can already see that I'm using it in DataGrid cause that's the only format it accepts. Yes, dfs are less efficient in terms of memory and time than np arrays, but we a priori allow for many natural application limitations (e.g. Binder allocate only 8 Gb of Ram) and suppose users to not have a reallyyy big datasets cause it's inconvenient to use this app then. So storing the file in df, but using np arrays for train and test splits seems reasonable to me.
Moreover, we've agreed to switch to csv/tsv file format, so pd read_csv is very suitable here and we should more likely focus on checking the right column types for now (e.g. not include srtings / dates / special characters etc.) and N/As presence.
I agree that working with df dataframes is much easier. But it seems to me that it requires to change a data class (at least there is no need to store a headers and everything should now expect a dataframe.to_numpy).
By the way: what is a benefit in using ipyfilechooser? An option: https://ipywidgets.readthedocs.io/en/stable/examples/Widget%20List.html#File-Upload
I agree that working with df dataframes is much easier. But it seems to me that it requires to change a data class (at least there is no need to store a headers and everything should now expect a dataframe.to_numpy).
Sure, it was an early stage "kludge" to process headers and data like that, so I will start refactoring everything else.
By the way: what is a benefit in using ipyfilechooser? An option: https://ipywidgets.readthedocs.io/en/stable/examples/Widget%20List.html#File-Upload
Yes, that was my first option, but I quickly realized that it has 10 MB file size limitation, and none of our previous files satisfied that. There are some workarounds which tweak jupyter config file, but I've tested it and the file upload (around 100 MB) just crashed the kernel. In some similar issue an ipyfilechooser
was suggested and I thought that it suits our UI well, cause you can upload the file via Jupyter server directly.
After a messing a little with preprocessing, i've got some questions:
As you can see, our "errors" are not quite errors. They are just some conditions within which we display some messages to users in order to not scare them. So what I mean by "error handler" is to have some generalization of different messages types, which we can use in different parts and don't mess up / repeat code. You can see in my last commit that I added Enums Messages for that purpose, plus replaced printing from managers to UI. It has nothing with things like try-except's, which could have a great use in code itself, but shouldn't display any "big errors" in terms of UI, these are just for development.
I want to use pytest
for unit-testing, but I encourage you to wait with that a little bit more, as I still change code often, and that could lead to errors. I will write basic tests when it's ready, and you can then add any cases, etc.
That was one of my questions too. I'm not sure if we should allow users to make changes in file via UI itself, cause it could later lead to some redundant code extensions, like writing the whole file editor. Yet again, we have some reasonable limitations in app functionality, but I don't know if there is a need for adding at least basic things, like NaNs handler, columns-dropper and so on.
For now we definitely should accept only ints and floats, as models are designed to only accept these and nothing else. As for delimiters, I think "," is the most common anyway, I'm not sure about ";", but blank spaces are not acceptable, as the are very error-prone, especially in headers processing.
The idea of every widget module (at least in my mind) is to separate the workflows. So uploading is meant for just loading the file in memory, and then you can examine it via the other widgets. DataGrid allows one to sort/filter and more with its library. But it's expected that user doesn't see his dataset for the first time, so the DataGrid isn't directly intended for finding some NaNs/dates/strings and stuff like that, it just shows some columns/statistics, etc. Once again that is more a question of users' datasets limitations.
Create an intermediate structure of preprocessing the given dataset before passing it into a program.
Main concerns: