Open bit-scientist opened 1 year ago
Hi, @JingyeChen, could you take a loot at this issue?
The downloaded weight trocr-small-handwritten.pt
should be one file instead of a folder.
Well, I knew something wasn't right here 😞 .
The link provided to download doesn't download any .pt
file. For what is worth, once you add that suffix string (?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D
) given at Fine-tuning and evaluation it downloads a zip folder with the following content:
- archive/
- data.pkl
- version (no extension)
- data/
- file1
- file2
- ...
- file348
Is this expected or the link is somehow broken? Thank you.
You could try to load the zip file.
Thanks, @donglixp, I am following as you said. After some file modifications and package installations, I stumbled upon this:
Traceback (most recent call last):
File "pic_inference.py", line 70, in <module>
model, cfg, task, generator, bpe, img_transform, device = init(model_path, beam)
File "pic_inference.py", line 13, in init
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(
File "F:\project\unilm\trocr\fairseq\fairseq\checkpoint_utils.py", line 443, in load_model_ensemble_and_task
task = tasks.setup_task(cfg.task, from_checkpoint=True)
File "F:\project\unilm\trocr\fairseq\fairseq\tasks\__init__.py", line 47, in setup_task
return task.setup_task(cfg, **kwargs)
File "F:\project\unilm\trocr\task.py", line 110, in setup_task
return cls(args, target_dict)
File "F:\project\unilm\trocr\task.py", line 120, in __init__
self.bpe = self.build_bpe(args)
File "F:\project\unilm\trocr\fairseq\fairseq\tasks\fairseq_task.py", line 646, in build_bpe
return encoders.build_bpe(args)
File "F:\project\unilm\trocr\fairseq\fairseq\registry.py", line 65, in build_x
return builder(cfg, *extra_args, **extra_kwargs)
File "F:\project\unilm\trocr\fairseq\fairseq\data\encoders\sentencepiece_bpe.py", line 41, in __init__
self.sp.Load(sentencepiece_model)
File "C:\Users\user\anaconda3\envs\trocr\lib\site-packages\sentencepiece\__init__.py", line 905, in Load
return self.LoadFromFile(model_file)
File "C:\Users\user\anaconda3\envs\trocr\lib\site-packages\sentencepiece\__init__.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: D:\a\sentencepiece\sentencepiece\src\sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]
As I learn, it might be beacuse of some wrong vocabulary. I modified the task.py
to download 'https://layoutlm.blob.core.windows.net/trocr/dictionaries/unilm3.dict.txt?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D'
(it had no suffix initially). Other than that, I didn't make any substantial changes on the code. What is your say on this?
@bit-scientist , I'm unsure whether this vocab file was corrupted during downloading. Have you tried re-downloading it manually?
Thanks for a quick reply, @Dod-o !
I just went ahead and checked the dictionary, it sure looks corrupt to me:
What am I supposed to do now if the original file is broken? Could you drop it here if you happen to have one?
Any solution yet to the problem?
[Update] Got the problem solved by directly inputting the zip
file and adding the URL suffix (provided in README) in task.py
Model I am using (UniLM, MiniLM, LayoutLM ...): TrOCR
I am using pic_inference.py to test with
trocr-small-handwritten.pt
model. The provided pt model doesn't have any model with.pt
extension. It has a folder called archive, there are one data folder (with several files inside) and data.pkl, version files. I put the full path ofdata.pkl
at https://github.com/microsoft/unilm/blob/5255d52de86dad642810f5849dd357769346c1d7/trocr/pic_inference.py#L66 like this:model_path = 'S:/unilm/trocr/models/archive/data.pkl'
. Am I doing something wrong?