Look into `lda` library or `sklearn` as alternative to MALLET

tsalo commented 3 years ago

Summary

The library: https://github.com/lda-project/lda

Additional details

The main reason we use MALLET instead of scikit-learn (which also has an LDA implementation), is that scikit-learn's version produces results that don't make as much sense as MALLET- meaning that the topics we were seeing didn't look like meaningful collections of words. If the lda library can work as well as MALLET, but in pure Python, then that is definitely preferable.

Next steps

Run lda on a test Dataset or two to see what the resulting topics look like.
If they look good, drop MALLET in favor of lda.

tsalo commented 3 years ago

I started a job on the HPC testing this in a qualitative way. I'm training LDA models using Neurosynth and the three available implementations (sklearn, lda, and NiMARE/MALLET). I will look at (1) how long each takes and (2) whether the topics look reasonable or not.

tsalo commented 3 years ago

lda took about 60 minutes using its default of 2000 iterations.
sklearn took about 6 minutes using its default of a maximum of 1000 iterations.
NiMARE took about 6 minutes using its default of 1000 iterations.

All three look pretty good to me, so I'm actually considering just switching to scikit-learn from now on...

Here are the top 10 "words" (both unigrams and bigrams) from the first 10 topics for each method:

lda

Topic 001: pain women men insular intensity female male sex painful Topic 002: reward striatum ventral anticipation ventral striatum monetary loss rewards motivation Topic 003: stimuli stimulus response presented presentation modulated visual stimuli responded respond Topic 004: condition conditions conflict generation problem experimental involved control condition resolution Topic 005: thalamus caudate putamen basal ganglia basal ganglia nucleus striatal circuits Topic 006: state resting resting state state functional fc spontaneous reho rs regional Topic 007: trials error trial errors prediction goal monitoring directed internal Topic 008: individuals ability higher suggest tool correlates linked abilities sample Topic 009: controls healthy schizophrenia healthy controls disorder hc abnormalities deficits altered Topic 010: sulcus superior intraparietal number superior temporal intraparietal sulcus ips posterior temporal sulcus

scikit-learn

Topic 001: motor movement movements sensorimotor hand primary finger contralateral primary motor Topic 002: prefrontal medial prefrontal cortex events medial prefrontal physical lateral contextual lateral prefrontal Topic 003: information spatial representations location integration human acupuncture representation locations Topic 004: bold signal level blood bold signal response oxygen level bold blood oxygen Topic 005: control cingulate anterior cingulate cingulate cortex anterior acc cognitive cognitive control prefrontal Topic 006: cue alcohol exposure use determined cues craving controlling dependence Topic 007: matter white white matter grey grey matter fa diffusion integrity structural Topic 008: emotion regulation emotions emotion regulation cognitive strategies emotional amygdala reappraisal Topic 009: occipital objects lateral visual shape scenes scene occipital cortex properties Topic 010: future accuracy dissociation speed direct make far accurate suggested

NiMARE/MALLET

Topic 001: eeg frequency fmri activity bold hz signal cortical power Topic 002: network connectivity default regions task dmn mode functional activity Topic 003: connectivity functional network state resting networks regions brain fc Topic 004: cortex prefrontal anterior regions medial cingulate posterior brain functional Topic 005: response inhibition control task error feedback trials errors activation Topic 006: fmri time data activation subjects group analysis brain studies Topic 007: women stress men activation brain sex regulation olfactory amygdala Topic 008: brain regions studies neural functional human specific evidence findings Topic 009: social participants neural empathy activity person perspective human interaction Topic 010: language left hemisphere lateralization english activation native hemispheric processing

tsalo commented 3 years ago

One consequence of using sklearn or lda for LDA is that we won't really need a NiMARE class for it. Really, we just need an example that covers how to (1) download abstracts with NiMARE, (2) generate a vocabulary and counts with either NiMARE or sklearn, (3) run the LDA model with whichever package we choose, and (4) convert the resulting arrays to DataFrames for readability.

Unless, of course, we create the Annotator base class (#488), in which case there's a use-case for a class that will incorporate the topic-word and doc-topic weights into a Dataset.

neurostuff / NiMARE