fairseq\fairseq\data\dictionary.py", line 259, in add_from_file raise ValueError( ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'

samrudh commented 3 years ago

OS: Windows While createing a class , below error occurred: q\fairseq\data\dictionary.py", line 246, in add_from_file count = int(field) ValueError: invalid literal for int() with base 10: 'https://git-lfs.github.com/spec/v1'

Stacktrace shows: fairseq\data\dictionary.py", line 259, in add_from_file raise ValueError( ValueError: Incorrect dictionary format, expected ' [flags]'

I believe the wrong value of the field is being set somewhere

Code:

from asamiasami import Hi2EnTranslator
hi2EnObj = Hi2EnTranslator()

swapniljadhav1921 commented 3 years ago

This is because of fairseq version or lfs version. Please try updated steps given for installation. Also, make sure to take updated repo.

swapniljadhav1921 commented 3 years ago

Closing the issue due to inactivity. Please open the same if you have further doubt.

akshay951228 commented 3 years ago

Hi , thanks for you great work!, I'm still facing the same , follow exact setup in readme Error:- ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'

lacls commented 3 years ago

Hi , thanks for you great work!, I'm still facing the same , follow exact setup in readme Error:- ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]' This was exactly what I am facing too. Just because in the tokenization of specific model (in my case of using PhoBERT, the file is stored at transformers/models/PhoBERT/Tokenization_PhoBert)

def add_from_file(self, f): """ Loads a pre-existing dictionary from a text file and adds its symbols to this instance. """ if isinstance(f, str): try: with open(f, "r", encoding="utf-8") as fd: self.add_from_file(fd) except FileNotFoundError as fnfe: raise fnfe except UnicodeError: raise Exception(f"Incorrect encoding detected in {f}, please rebuild the dataset") return lines = f.readlines() for lineTmp in lines: line = lineTmp.strip() idx = line.rfind(" ") if idx == -1: raise ValueError("Incorrect dictionary format, expected ' '") word = line[:idx] self.encoder[word] = len(self.encoder)


Because of only appending the token without (_cnt tag_, I don't really know what does it present for)

Do you guys have any other approach, please share. I do really appreciate that.

swapniljadhav1921 commented 3 years ago

Opening the issue again ... it is happening bcz I initially used git-lfs for file storage and later removed. It converted dict files to different text. You can check in lets say dict.en.txt file. No dictionary present .. hence code is failing. I will make sure to provide correct files soon. Opening the issue again.

And sorry for the super delayed reply .. didn't notice bcz of closed issue @akshay951228 @lacls

swapniljadhav1921 commented 3 years ago

Issues with LFS

Due to various issues with LFS files initially added to LFS later removed .. created unstable file versions which are currently present in repo. File sizes are big and github with free version has size limitations. I propose to use files from this location -> https://drive.google.com/drive/folders/18x_vGGa5v3jT-Zx73u0eKFfDGyw9M_aB?usp=sharing Same folder structure ... please replace git files with these files ... and then LFS is not required. Please update if found any issue here -> https://github.com/swapniljadhav1921/asamiasami/issues/2 Very non efficient way .. but will make it more usable later.

swapniljadhav1921 commented 3 years ago

@lacls @akshay951228 plz do check and let me know in case of any issue ... I have checked on my setup.

Pogayo commented 2 years ago

I got this error because my dictionary file generated by sentencepiece tokenizer was tab-separated. Replacing the tabs by space solved it for me. Remember to also remove the unknown, bos, and eos tokens from the dictionary if you are using sentencepiece.

swapniljadhav1921 / asamiasami

fairseq\fairseq\data\dictionary.py", line 259, in add_from_file raise ValueError( ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]' #2