Closed OMGAmici closed 7 months ago
Hi, thank you for opening this issue. Note that Torchhd is agnostic to which dataset is being used so that you can use it with your own datasets. The provided datasets are there for your convenience. That being said, the first thing that comes to mind is that your dataset might include more characters. The language_recognition.py
example currently only supports the "a"
through "z"
ascii characters in addition to a space " "
. If your dataset contains more characters, then you will have to either filter them out, or extend the example to support more characters by increasing the NUM_TOKENS
and modifying transform
accordingly. Perhaps you can print the raw strings and transformed tensor input that causes the error?
It would also help if you could run the code on the cpu instead of gpu so that the error message actually throws at the right location, because the error in your error.txt does not match with your description. If you have further details I am happy to see if I can help.
Thank you for this thoughtful and quick response! Your suspicion that my dataset contained more characters was spot on.
Glad I could help. Hope you enjoy using Torchhd!
I'm having issues trying this library out on a dataset not included in torchhd.datasets.
Following the example in langauge_recognition.py, I've adapted my dataset (if it's relevant, WNLI) into what seems to be the same format as used for EuropeanLanguages in torchhd.datasets. Yet, when I try to replace the EuropeanLanguages dataset with my dataset in an adapted version of langauge_recognition.py, I run into an index mismatch error.
The error occurs at line 64
symbols = self.symbol(x)
. If I change line 61self.symbol = embeddings.Random(size, out_features, padding_idx=PADDING_IDX)
toembedding.Linear()
, the error disappears. I've confirmed that everything about my dataset (data type, dimension, etc) seems to align the EuropeanLangauges dataset, so I'm a bit lost at what is off. Detailed error output: error.txtI've combed through the documentation but, short of reading the entire codebase, I haven't found anything conclusive to indicate where the disconnect is. Is there a guide somewhere for how to properly use this library on datasets that aren't already integrated? Thank you in advance!