hyperdimensional-computing / torchhd

Torchhd is a Python library for Hyperdimensional Computing and Vector Symbolic Architectures
https://torchhd.readthedocs.io
MIT License
221 stars 23 forks source link

Data issue #163

Closed OMGAmici closed 7 months ago

OMGAmici commented 7 months ago

I'm having issues trying this library out on a dataset not included in torchhd.datasets.

Following the example in langauge_recognition.py, I've adapted my dataset (if it's relevant, WNLI) into what seems to be the same format as used for EuropeanLanguages in torchhd.datasets. Yet, when I try to replace the EuropeanLanguages dataset with my dataset in an adapted version of langauge_recognition.py, I run into an index mismatch error.

The error occurs at line 64 symbols = self.symbol(x). If I change line 61 self.symbol = embeddings.Random(size, out_features, padding_idx=PADDING_IDX) to embedding.Linear(), the error disappears. I've confirmed that everything about my dataset (data type, dimension, etc) seems to align the EuropeanLangauges dataset, so I'm a bit lost at what is off. Detailed error output: error.txt

I've combed through the documentation but, short of reading the entire codebase, I haven't found anything conclusive to indicate where the disconnect is. Is there a guide somewhere for how to properly use this library on datasets that aren't already integrated? Thank you in advance!

mikeheddes commented 7 months ago

Hi, thank you for opening this issue. Note that Torchhd is agnostic to which dataset is being used so that you can use it with your own datasets. The provided datasets are there for your convenience. That being said, the first thing that comes to mind is that your dataset might include more characters. The language_recognition.py example currently only supports the "a" through "z" ascii characters in addition to a space " ". If your dataset contains more characters, then you will have to either filter them out, or extend the example to support more characters by increasing the NUM_TOKENS and modifying transform accordingly. Perhaps you can print the raw strings and transformed tensor input that causes the error?

It would also help if you could run the code on the cpu instead of gpu so that the error message actually throws at the right location, because the error in your error.txt does not match with your description. If you have further details I am happy to see if I can help.

OMGAmici commented 7 months ago

Thank you for this thoughtful and quick response! Your suspicion that my dataset contained more characters was spot on.

mikeheddes commented 7 months ago

Glad I could help. Hope you enjoy using Torchhd!