feature dimension mismatch between train and test data

xiaohan2012 commented 3 years ago

Hi,

There seems to be a bug in the data loading process.

For example:

from napkinxc.datasets import load_dataset
trn_X, _ = load_dataset('wiki10-31k', 'train')
tst_X, _ = load_dataset('wiki10-31k', 'test')
print('# of features of training data', trn_X.shape[1])
print('# of features of test data', tst_X.shape[1])

gives:

# of features of training data 101938
# of features of test data 101937

Cheers, Han

mwydmuch commented 3 years ago

Hi @xiaohan2012, it's not a bug. The files are loaded correctly. There is basically no feature 101937 in the wiki10-31k test set, while it is in the train set. This is also a case for some other datasets from the XMLC repo. The method that loads the data reads all libsvm file formats and standard libsvm file format does not include information about a number of features/columns, so when casting to scipy.csr_matrix value of shape[1] is simply deduced from loaded data. I agree that it would be nicer if the numbers of columns match, and it could be improved, but since the data are sparse, it's not really a problem to resize it.

xiaohan2012 commented 3 years ago

Thanks for the reply.

As for resizing it, I assume I add a zero column somewhere in the smaller matrix (tst_X in the above example).

Where should I add it, before the 1st column or after the last one?

xiaohan2012 commented 3 years ago

Ah, sorry, I got it. I append a zero column after the last column :)

mwydmuch / napkinXC

feature dimension mismatch between train and test data #18