Topic extraction - Githubissues

mrbrianevans / social-media-export-analyser

Analyse GDPR exports of your data from big social media companies

https://social-media-export-analyser-mrybc.ondigitalocean.app/

MIT License

1 stars 0 forks source link

Topic extraction #57

Closed mrbrianevans closed 2 years ago

mrbrianevans commented 2 years ago

Use natural language processing to extract topics from data, and display them in a separate analysis tab.

This is to be done with fast-topics package.

It can be applied to things like YouTube watch history(#47), twitter posts and messaging conversations like whatsapp and telegram.

The interesting thing here is the topics which are extracted. Not classifying each document. Not yet sure how to choose the number of topics to extract.

Need a way of only displaying topics that documents match closely to, and filtering out ones which contain generic words that match every document.

mrbrianevans commented 2 years ago

https://github.com/mrbrianevans/social-media-export-analyser/commit/8be56f0839d0026b22c7fc85372d65aba86b2566 added an MVP of topic extraction. There are more features that could be added to improve the user experience.

The call to getTopics must be put in a worker thread, because at the moment its blocking and it can take a few seconds depending on corpus size.
There should be a button for the user to regenerate the topics, as the operation is non-deterministic.
The time to extract topics should be measured with performance.now() and shown to the user.
Potentially an idea is to allow the user to choose the number of topics? Not sure if this is a good idea, but its one option.

mrbrianevans commented 2 years ago

Call to web assembly has been moved to a web worker, freeing up the main thread and keeping the website interactive while loading. A loading icon is shown in the tab and the tab is disabled while its being calculated. Topics are only calculated when a file is selected. The WASM file is only loaded when a file using topic extraction is processed (ie lazy loading).

A design pattern has been established for off-thread computation. This serves as an example.

Performance timing is done, but results are not shown to the user, but instead logged to the console. There is no button for the user to recalculate it and the user cannot control how many topics are extracted. This is optional future work.