Open samrudh opened 3 years ago
This is because of fairseq version or lfs version. Please try updated steps given for installation. Also, make sure to take updated repo.
Closing the issue due to inactivity. Please open the same if you have further doubt.
Hi ,
thanks for you great work!,
I'm still facing the same , follow exact setup in readme
Error:-
ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'
Hi , thanks for you great work!, I'm still facing the same , follow exact setup in readme Error:-
ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'
This was exactly what I am facing too. Just because in the tokenization of specific model (in my case of using PhoBERT, the file is stored at transformers/models/PhoBERT/Tokenization_PhoBert)
def add_from_file(self, f):
"""
Loads a pre-existing dictionary from a text file and adds its symbols to this instance.
"""
if isinstance(f, str):
try:
with open(f, "r", encoding="utf-8") as fd:
self.add_from_file(fd)
except FileNotFoundError as fnfe:
raise fnfe
except UnicodeError:
raise Exception(f"Incorrect encoding detected in {f}, please rebuild the dataset")
return
lines = f.readlines()
for lineTmp in lines:
line = lineTmp.strip()
idx = line.rfind(" ")
if idx == -1:
raise ValueError("Incorrect dictionary format, expected '
Because of only appending the token without (_cnt tag_, I don't really know what does it present for)
Do you guys have any other approach, please share. I do really appreciate that.
Opening the issue again ... it is happening bcz I initially used git-lfs for file storage and later removed. It converted dict files to different text. You can check in lets say dict.en.txt file. No dictionary present .. hence code is failing. I will make sure to provide correct files soon. Opening the issue again.
And sorry for the super delayed reply .. didn't notice bcz of closed issue @akshay951228 @lacls
Issues with LFS
Due to various issues with LFS files initially added to LFS later removed .. created unstable file versions which are currently present in repo. File sizes are big and github with free version has size limitations. I propose to use files from this location -> https://drive.google.com/drive/folders/18x_vGGa5v3jT-Zx73u0eKFfDGyw9M_aB?usp=sharing Same folder structure ... please replace git files with these files ... and then LFS is not required. Please update if found any issue here -> https://github.com/swapniljadhav1921/asamiasami/issues/2 Very non efficient way .. but will make it more usable later.
@lacls @akshay951228 plz do check and let me know in case of any issue ... I have checked on my setup.
I got this error because my dictionary file generated by sentencepiece tokenizer was tab-separated. Replacing the tabs by space solved it for me. Remember to also remove the unknown, bos, and eos tokens from the dictionary if you are using sentencepiece.
OS: Windows While createing a class , below error occurred: q\fairseq\data\dictionary.py", line 246, in add_from_file count = int(field) ValueError: invalid literal for int() with base 10: 'https://git-lfs.github.com/spec/v1'
Stacktrace shows: fairseq\data\dictionary.py", line 259, in add_from_file raise ValueError( ValueError: Incorrect dictionary format, expected ' [flags]'
I believe the wrong value of the field is being set somewhere
Code: