microsoft / MASS

MASS: Masked Sequence to Sequence Pre-training for Language Generation
https://arxiv.org/pdf/1905.02450.pdf
Other
1.12k stars 206 forks source link

Incorrect dictionary format #166

Open abdullahkhilji opened 4 years ago

abdullahkhilji commented 4 years ago

I have matched the dictionary generated using XLM code and the sample given here at MASS, though the format matches it still gives an error.

Traceback (most recent call last):
  File "/home/abdullahkhilji/miniconda3/envs/mass/bin/fairseq-preprocess", line 8, in <module>
    sys.exit(cli_main())
  File "/home/abdullahkhilji/miniconda3/envs/mass/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 267, in cli_main
    main(args)
  File "/home/abdullahkhilji/miniconda3/envs/mass/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 80, in main
    src_dict = task.load_dictionary(args.srcdict)
  File "/home/abdullahkhilji/miniconda3/envs/mass/lib/python3.8/site-packages/fairseq/tasks/cross_lingual_lm.py", line 82, in load_dictionary
    return MaskedLMDictionary.load(filename)
  File "/home/abdullahkhilji/miniconda3/envs/mass/lib/python3.8/site-packages/fairseq/data/dictionary.py", line 176, in load
    return cls.load(fd)
  File "/home/abdullahkhilji/miniconda3/envs/mass/lib/python3.8/site-packages/fairseq/data/dictionary.py", line 192, in load
    raise ValueError("Incorrect dictionary format, expected '<token> <cnt>'")
ValueError: Incorrect dictionary format, expected '<token> <cnt>'
abdullahkhilji commented 4 years ago

I have created the dictionary as created by GLoVe it works but takes a lot of time, is it required to keep a bar on the number of words in the dictionary? Else it consumes a lot of time.

StillKeepTry commented 4 years ago

As introduced in error, you should keep the format of the dictionary as . For example:

A 10000
B 10000

The value of cnt is no matter, but it must be provided.

abdullahkhilji commented 4 years ago

I was following the same format. The error got fixed after I reduced the size of dict.en.txt it was around 800MB. Reducing the file below 10MB after considering the fine tune data only worked. Will have to set a threshold for a better solution.