idiap / gile

A generalized input-label embedding for text classification
GNU General Public License v3.0
23 stars 6 forks source link

Running gile on custom data: issue with util.load_vectors() #1

Closed dtuggener closed 4 years ago

dtuggener commented 4 years ago

I'm trying to train gile on a custom dataset. I convert the data to the required json structure based on https://github.com/idiap/mhan/blob/master/fetch_data.py (i.e. lists of word ids).

When I try to train the model using run.py, I get an IndexError on this line in utils.py: https://github.com/idiap/gile/blob/96438d89e6cc8b0ef68820463a73c3e3342c9d5b/util.py#L69 i.e. *** IndexError: index 2446 is out of bounds for axis 1 with size 816 My labelset is of size 816, but the word ids of the label words obviously range higher than that. I.e. y_idxs[idx]for this particular example looks this: [[310], [2446], [7075]]

If I read https://github.com/idiap/mhan/blob/ffcfb8df5e004a4f1d12de7500b512b74399e099/fetch_data.py#L30 correctly (extract_wordids(keywords, lang, vocab)), it splits label names into tokens and returns the list of their token indexes in the vocabulary. How can this match the labelset size in order for to_categorical() to work?

nik0spapp commented 4 years ago

Thanks for raising this issue!

It turns out that the data format in the fetch_data.py from the mhan repository was not compatible with the format expected by the other scripts. I suppose most people were using the pre-processed data which was in the right format or were directly looking at what data format the training scripts expect.

The above error is raised because the y_idxs[idx] are indexed according to the vocabulary and not according to the label set. I have fixed fetch_data.py to perform this in the mhan repository; please take a look at re_index() function.

This should resolve the issue you described but please let me know if you have any further questions.

dtuggener commented 4 years ago

The above error is raised because the y_idxs[idx] are indexed according to the vocabulary and not according to the label set

Exactly, I also fixed this on my end and it worked afterwards. Thanks for the quick reply and the update!