Add fetcher/converter for NeuroQuery dataset

tsalo commented 3 years ago

Summary

One thing that came up during our OHBM educational course panel on meta-analysis was that NiMARE should be able to ingest NeuroQuery's dataset. Since the data are available on GitHub, we can create a fetcher like nimare.extract.fetch_neurosynth and convert the data to a NiMARE Dataset.

Additional details

Access to a very large, very high-quality dataset of coordinates and annotations.

Related issues:

https://github.com/neurosynth/neurosynth-data/issues/5
https://github.com/neuroquery/neuroquery_data/issues/3
https://github.com/neurostuff/neurostore/issues/42
Next steps
1. Figure out how the NeuroQuery data format can be converted to NiMARE format.
2. Add a function to grab the NeuroQuery data.
3. Test/document.

tsalo commented 3 years ago

Per today's Neurostore call, @jeromedockes will convert the files in neuroquery_data to the same format as Neurosynth's data in neurosynth-data. This will make fetching and converting to NiMARE Datasets trivial, since we already have a conversion function for that format.

tsalo commented 3 years ago

I'm spending some time to review the NeuroQuery dataset and there's obviously a lot more information there than in the Neurosynth dataset. I think the best plan is to (1) break the "feature" files down by section/source (which is how the npz files are already organized), and (2) update convert_neurosynth_to_json to support multiple feature files.

However, I worry that that ignores a number of useful elements to the dataset, including (1) the categories each term may be assigned to and (2) synonym mappings. The only similar structure in NiMARE at the moment is in the cogat module, so we should break this down into both a Neurosynth-format set of files and a "vocabulary" like the Cognitive Atlas. I can handle any file reorganization that might need to happen for the latter on NiMARE's end of things, although I'll probably start by just working on the Dataset conversion.

@jeromedockes, a couple of questions for you, if you have the time:

Which count files are used to produce corpus_tfidf.npz? Are the counts from all of the sections just added together and then TF-IDF-transformed? Also, do the TF-IDF values come from the smaller or the larger vocabulary?
Were the synonyms defined in vocabulary.csv_voc_mapping_identity.json combined before term counts were generated, before the TF-IDF transformation, or after both?
Would it be possible to regularly update the dataset?

jeromedockes commented 3 years ago

I'm spending some time to review the NeuroQuery dataset and there's obviously a lot more information there than in the Neurosynth dataset. I think the best plan is to (1) break the "feature" files down by section/source (which is how the npz files are already organized), and (2) update convert_neurosynth_to_json to support multiple feature files.

However, I worry that that ignores a number of useful elements to the dataset, including (1) the categories each term may be assigned to and (2) synonym mappings. The only similar structure in NiMARE at the moment is in the cogat module, so we should break this down into both a Neurosynth-format set of files and a "vocabulary" like the Cognitive Atlas. I can handle any file reorganization that might need to happen for the latter on NiMARE's end of things, although I'll probably start by just working on the Dataset conversion.

please let me know how I can help transform the neuroquery data into the most convenient format. based on yesterday's conversation I'll create a version in the NeuroSynth format (but as you say not everything can be represented like this).

Which count files are used to produce corpus_tfidf.npz?

Those in training_data/corpus_word_counts/pmids.txt

Are the counts from all of the sections just added together and then TF-IDF-transformed?

They are transformed to TF-IDF, then averaged (with equal weights) to produce the final TF-IDF. This means that a term in a shorter section (eg the title) will have a more important weight in the final representation than the same term in another section (eg the body).

Also, do the TF-IDF values come from the smaller or the larger vocabulary?

from the smaller one. I can compute the TF-IDF for the large vocabulary if you think that is useful, but the goal of corpus_word_counts_large_vocabulary was just to provide data that is as complete / raw as possible (providing the actual text is not allowed), for users who may want to do a different processing of it.

Were the synonyms defined in vocabulary.csv_voc_mapping_identity.json combined before term counts were generated, before the TF-IDF transformation, or after both?

after both. the vocabulary.txt and tfidf files have 7547 terms. 1239 terms are mapped to their synonym by the neuroquery tokenizer, so the output of the neuroquery tokenizer, ie the number of features used in the neuroquery linear model, ends up being 7547 - 1239 = 6308 terms after merging synonyms.

Would it be possible to regularly update the dataset?

Unfortunately, not easily at the moment -- but it should really be the case. It will require a little work because

part of the documents come from Elsevier, which requires a (paid) account, and that part of the data collection probably won't be maintained
IIRC I used some XSLT features not supported by lxml (ie not usable with only python packages)

so to produce a public / maintainable version of the data collection the Elsevier part needs to be dropped and the stylesheets need to be simplified a bit, plus probably a bit of refactoring and documentation. I'll try to find some time for this but I'm not sure when .

tsalo commented 3 years ago

Thanks @jeromedockes! Those answers are really helpful. I took a quick crack at converting the files and it looks like the Neurostnth-format files will end up being really big. Given the size, I wonder if somewhere better suited for large files might be better, like FigShare or OSF. WDYT?

It could be fine when the files are compressed though- I haven't tried that out yet.

jeromedockes commented 3 years ago

Thanks @jeromedockes! Those answers are really helpful. I took a quick crack at converting the files and it looks like the Neurostnth-format files will end up being really big.

Thanks a lot for looking into this!

I believe file size was one of the reasons I had chosen to store the tfidf in .npz files rather than tsv as the NeuroSynth format does. Are these the files that pose a problem (ie the features.txt after conversion to NeuroSynth's format?)

Which files are you thinking of including? Those with the full vocabulary are probably huge, but maybe the subset with 7K terms is sufficient?

Given the size, I wonder if somewhere better suited for large files might be better, like FigShare or OSF. WDYT?

There is a NeuroQuery OSF project:

https://osf.io/5qjpb/

that I have been using to store larger files; that might be a good option.

Still the online storage is not the only issue -- keeping the dataset reasonably small is also nice for download times and users' disk space, and reading large tsv files is slow and can use a lot of memory.

It could be fine when the files are compressed though- I haven't tried that out yet.

I believe the NeuroSynth-format tsv tfidf will compress very well -- as the tfidf matrices are very sparse these files will contain long stretches of just the separator character. For example NeuroSynth's features.txt goes from 189M to 9.5M once gzipped.

tsalo commented 3 years ago

Which files are you thinking of including? Those with the full vocabulary are probably huge, but maybe the subset with 7K terms is sufficient?

I was planning to set up all of the feature files as separate Neurosynth-format files, like fullvocab_title_counts.tsv, standardvocab_title_counts.tsv, and standardvocab_corpus_tfidf.tsv. Then, within the NiMARE Dataset, those could be loaded as annotations with different "feature groups" (i.e., prefixes), like Neuroquery_TitleCounts__.

Still the online storage is not the only issue -- keeping the dataset reasonably small is also nice for download times and users' disk space, and reading large tsv files is slow and can use a lot of memory.

I could probably let the fetcher accept a "feature_set" argument or something, so it would only download specific features. That would at least limit the amount of space the dataset would take up.

I believe the NeuroSynth-format tsv tfidf will compress very well

Awesome. We may want to reorganize the Neurosynth data as well then.

tsalo commented 3 years ago

I have another question about the corpus. I noticed that the files in training_data contain 13,881 studies, while the official corpus files in neuroquery_model contain 13,459 studies. What is the source of the difference?

jeromedockes commented 3 years ago

Awesome. We may want to reorganize the Neurosynth data as well then.

still uncompressing and loading the large tsv takes time; loading it into a dense array takes memory -- if the neurosynth data is reorganized, maybe some other format such as npz should be reconsidered

jeromedockes commented 3 years ago

I have another question about the corpus. I noticed that the files in training_data contain 13,881 studies, while the official corpus files in neuroquery_model contain 13,459 studies. What is the source of the difference?

indeed, sorry about that. they correspond to studies that have all their coordinates outside of the mni brain mask. I will remove them from the files in training_data

tsalo commented 3 years ago

they correspond to studies that have all their coordinates outside of the mni brain mask.

Thanks for the clarification!

if the neurosynth data is reorganized, maybe some other format such as npz should be reconsidered

I'm happy to use npz in both repositories, although NiMARE will need to internally reformat to DataFrames.

EDIT: Also, I think the count files could be smaller if you used int instead of float64.

jeromedockes commented 3 years ago

I'm happy to use npz in both repositories, although NiMARE will need to internally reformat to DataFrames.

in theory DataFrames can hold sparse data (pandas documentation) -- but it may not be worth the trouble

EDIT: Also, I think the count files could be smaller if you used int instead of float64.

good point! I will do that

tsalo commented 3 years ago

in theory DataFrames can hold sparse data

Whoa... okay, I'm going to look into that for NiMARE's DataFrame-based Dataset attributes. Seems like it could be pretty useful.

neurostuff / NiMARE

Add fetcher/converter for NeuroQuery dataset #522

Summary

Additional details

Next steps