Closed dtuggener closed 4 years ago
Thanks for raising this issue!
It turns out that the data format in the fetch_data.py from the mhan repository was not compatible with the format expected by the other scripts. I suppose most people were using the pre-processed data which was in the right format or were directly looking at what data format the training scripts expect.
The above error is raised because the y_idxs[idx] are indexed according to the vocabulary and not according to the label set. I have fixed fetch_data.py to perform this in the mhan repository; please take a look at re_index() function.
This should resolve the issue you described but please let me know if you have any further questions.
The above error is raised because the y_idxs[idx] are indexed according to the vocabulary and not according to the label set
Exactly, I also fixed this on my end and it worked afterwards. Thanks for the quick reply and the update!
I'm trying to train gile on a custom dataset. I convert the data to the required json structure based on https://github.com/idiap/mhan/blob/master/fetch_data.py (i.e. lists of word ids).
When I try to train the model using run.py, I get an IndexError on this line in utils.py: https://github.com/idiap/gile/blob/96438d89e6cc8b0ef68820463a73c3e3342c9d5b/util.py#L69 i.e.
*** IndexError: index 2446 is out of bounds for axis 1 with size 816
My labelset is of size 816, but the word ids of the label words obviously range higher than that. I.e.y_idxs[idx]
for this particular example looks this:[[310], [2446], [7075]]
If I read https://github.com/idiap/mhan/blob/ffcfb8df5e004a4f1d12de7500b512b74399e099/fetch_data.py#L30 correctly (
extract_wordids(keywords, lang, vocab)
), it splits label names into tokens and returns the list of their token indexes in the vocabulary. How can this match the labelset size in order forto_categorical()
to work?