microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.63k stars 2.51k forks source link

UnpicklingError: A load persistent id instruction was encountered, but no persistent_load function was specified. #1248

Open bit-scientist opened 1 year ago

bit-scientist commented 1 year ago

Model I am using (UniLM, MiniLM, LayoutLM ...): TrOCR

I am using pic_inference.py to test with trocr-small-handwritten.pt model. The provided pt model doesn't have any model with .pt extension. It has a folder called archive, there are one data folder (with several files inside) and data.pkl, version files. image I put the full path of data.pkl at https://github.com/microsoft/unilm/blob/5255d52de86dad642810f5849dd357769346c1d7/trocr/pic_inference.py#L66 like this: model_path = 'S:/unilm/trocr/models/archive/data.pkl'. Am I doing something wrong?

bit-scientist commented 1 year ago

Hi, @JingyeChen, could you take a loot at this issue?

donglixp commented 1 year ago

The downloaded weight trocr-small-handwritten.pt should be one file instead of a folder.

bit-scientist commented 1 year ago

Well, I knew something wasn't right here 😞 . The link provided to download doesn't download any .pt file. For what is worth, once you add that suffix string (?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D) given at Fine-tuning and evaluation it downloads a zip folder with the following content:

- archive/
  - data.pkl
  - version (no extension)
  - data/
    - file1
    - file2
    - ...
    - file348

Is this expected or the link is somehow broken? Thank you.

donglixp commented 1 year ago

You could try to load the zip file.

bit-scientist commented 1 year ago

Thanks, @donglixp, I am following as you said. After some file modifications and package installations, I stumbled upon this:

Traceback (most recent call last):
  File "pic_inference.py", line 70, in <module>
    model, cfg, task, generator, bpe, img_transform, device = init(model_path, beam)
  File "pic_inference.py", line 13, in init
    model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(
  File "F:\project\unilm\trocr\fairseq\fairseq\checkpoint_utils.py", line 443, in load_model_ensemble_and_task
    task = tasks.setup_task(cfg.task, from_checkpoint=True)
  File "F:\project\unilm\trocr\fairseq\fairseq\tasks\__init__.py", line 47, in setup_task
    return task.setup_task(cfg, **kwargs)
  File "F:\project\unilm\trocr\task.py", line 110, in setup_task
    return cls(args, target_dict)
  File "F:\project\unilm\trocr\task.py", line 120, in __init__
    self.bpe = self.build_bpe(args)
  File "F:\project\unilm\trocr\fairseq\fairseq\tasks\fairseq_task.py", line 646, in build_bpe
    return encoders.build_bpe(args)
  File "F:\project\unilm\trocr\fairseq\fairseq\registry.py", line 65, in build_x
    return builder(cfg, *extra_args, **extra_kwargs)
  File "F:\project\unilm\trocr\fairseq\fairseq\data\encoders\sentencepiece_bpe.py", line 41, in __init__
    self.sp.Load(sentencepiece_model)
  File "C:\Users\user\anaconda3\envs\trocr\lib\site-packages\sentencepiece\__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "C:\Users\user\anaconda3\envs\trocr\lib\site-packages\sentencepiece\__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: D:\a\sentencepiece\sentencepiece\src\sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

As I learn, it might be beacuse of some wrong vocabulary. I modified the task.py to download 'https://layoutlm.blob.core.windows.net/trocr/dictionaries/unilm3.dict.txt?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D' (it had no suffix initially). Other than that, I didn't make any substantial changes on the code. What is your say on this?

Dod-o commented 1 year ago

@bit-scientist , I'm unsure whether this vocab file was corrupted during downloading. Have you tried re-downloading it manually?

bit-scientist commented 1 year ago

Thanks for a quick reply, @Dod-o !

I just went ahead and checked the dictionary, it sure looks corrupt to me: image image

What am I supposed to do now if the original file is broken? Could you drop it here if you happen to have one?

aasharma90 commented 10 months ago

Any solution yet to the problem?

[Update] Got the problem solved by directly inputting the zip file and adding the URL suffix (provided in README) in task.py