htrc / torchlite-backend

Backend API service for Torchlite web dashboard
2 stars 0 forks source link

Apply NLTK Stopwords Based On User Selection #116

Open dkudeki opened 3 weeks ago

dkudeki commented 3 weeks ago

In get_widget_data(), call a function that returns a version of filtered_volumes that has had all the selected data cleaning processes applied to it. I think it makes the most sense to follow the lead of what apply_filters() is doing, and create a function in data.py that handles data cleaning options. At this time that will just be applying stopwords, but it will be expanded in the future to apply additional cleaning options. Within this new function you will call an algorithm to apply the stopword list to the existing features data. You can look to the simple tag cloud widget for how to implement it. The stopword lists for each language from NLTK should be stored in a file or files so that they can be quickly accessed based on the user's selection.

dkudeki commented 3 weeks ago

Additionally, you'll want to modify the models in dashboard.py to contain a structure for data cleaning settings, like it has for filters. For now that structure can just store the selected language for the stopword list. Any of the existing models that have the filters field should also have a sibling field for data cleaning. This is how the front-end will pass the cleaning data to the backend.

dkudeki commented 3 weeks ago

You'll want to make equivalent changes on the frontend models to hold cleaning data here. Then you'll want to modify CleanDataWidget.tsx to handle user selections like DataFilterWidget.tsx does.