hmc-whisk / jsLDA

A React based version of jsLDA with brand new features added on
Other
0 stars 0 forks source link

Things left behind by Mia #205

Open mia1024 opened 2 years ago

mia1024 commented 2 years ago

So, there are a few things I have started but didn't quite have time to finish (just in case you didn't know, you can click on each item to expand)

1. Upload a different set of documents different from the training set for the model In the user study, one of my users expressed strong eagerness about uploading a set of documents that are not the training dataset. They said this would be very useful for examining the effectiveness across different models. This is currently not possible, and if you really try to do that when uploading MALLET file you'll probably get, at best, a cryptic error while uploading, and may possibly cause serious app crashes if the documents are accessed. This is only speculation and not tested.
2. Uploading stoplists The previous implementation requires the user to upload a stoplist whenever a custom document set is uploaded, which doesn't quite make sense because our default documents don't have stoplists either. This is probably very confusing so I removed the requirement for the stoplist. Now it is at a separate tab and marked as not implemented, with some comments left in the code as to why it wasn't implemented. https://github.com/hmc-whisk/jsLDA/blob/50401e9182e867ade46475afb2260023fb8b0497/src/Components/Pages/importExportPage.tsx#L269-L278
3. More validation on uploaded documents This is somewhat related to the first point. Since we allow users to upload documents we should better communicate what went wrong if anything goes wrong, instead of a generic "Error processing document" or something like that. This might be solved by a blog post instead of error messages. Either way, we need to communicate clearly to the users about what exactly we are expecting.
4. Possible relocation of Import&Export tab I implemented this in a hurry without really talking to anyone. This can definitely be moved to somewhere else instead of on its separate tab. The space left beneath feels really awkward to me. One solution is to merge it with the configure menu. This, ideally, should be implemented by merging the configure button into the Import&Export tab and have each thing we currently have (i.e. tokenizer etc) as a tab on the left like what we have for the Import&Export. This should make it look nicer than what we already have, and we can just use bootstrap instead of writing our own (and honestly, a bit janky) CSS.
5. Some code and interface cleaning The `core/serialization` module completely replaces the existing workflow for uploading and downloading documents. As such, a range of functions (`LDAModel.ready()`, `App.queueLoad()`, the data upload and model upload sections of the configure menu and all the respective handlers propagated into all components) can be removed or rewritten to not depend on, among other things, `d3.text()`. This should be one of the easier tasks to do in this list because it's just to find and delete things with no implementation needed. I have started the removal in ce10f0da, and a large chunk of the dead code can be traced by examing what props are no longer needed from `src/Components/Header/Uploader.tsx`.
6. Model auto saving on tab close With the serialization in place, we should be able to save the documents and models to the browser storage (IndexedDB) so that the user can save the progress without having to download the model. A detailed description of how the storage work and its limitations can be found as comments in `core/storage`. One important consideration we need is deciding whether we want to save the model into the storage in compressed or uncompressed form, since compression takes a significant amount of time (try loading a large document set and download it and re-upload it, pay attention to the status bar, and you can get a rough idea about how long it takes to compress and decompress things). This is essentially a trade-off of space and speed, and I have no good answer for it.