huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.02k stars 27.02k forks source link

save_pretrained on master results in tokenizers that cannot be loaded in v2.11 #5286

Closed vladislavkoz closed 4 years ago

vladislavkoz commented 4 years ago

🐛 Bug

Information

Model I am using sshleifer/distilbart- *

The problem arises when using: tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-xsum-12-3") -it fails here

To reproduce

Steps to reproduce the behavior:

tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-xsum-12-3") -it fails here

image

Environment info

sshleifer commented 4 years ago

Would it be possible to run

pip install transformers --upgrade

and try again? We have fixed a lot of bugs since 2.8.0

Pasted tracebacks are much easier to read than screenshots.

vladislavkoz commented 4 years ago

I belive that i was runing it. Let me try one more time.

чт, 25 июня 2020 г., 21:50 Sam Shleifer notifications@github.com:

Would it be possible to run

pip install transformers --upgrade

and try again? We have fixed a lot of bugs since 2.8.0

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/5286#issuecomment-649756239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJWNBQJVBSPKL52ZGPRPJ4LRYOL6XANCNFSM4OIUNXTQ .

vladislavkoz commented 4 years ago

Just checked it twice. Looks like I've run it in another conda env. Here is an another error message(with transformers==2.11.0). image

vladislavkoz commented 4 years ago

Would you like me to create another issue?

sshleifer commented 4 years ago

I can reproduce now, thanks. Will fix.

sshleifer commented 4 years ago

Issue is that code on master saves special_tokens_map.json as

{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true}}

and v2.11 cannot load this format (where mask_token is a dict).

I deleted special_tokens_mask.json, which seems to fix things. (the original facebook/bart-large-cnn/ doesn't have a special_tokens_mask.json).

cc @thomwolf

vladislavkoz commented 4 years ago

I'm able to create tokenizer only for 'distilbart-xsum-12-1' and 'distilbart-xsum-9-6' (I still see 'special token mask_token... error for all other distilbart tokenizers') The model can be uploaded only with these tokenizers. Then on the summarization step, I'm getting the following error: image Reproducible with both PyTorch versions: 1.5.1 and https://download.pytorch.org/whl/cpu/torch-1.0.1.post2-cp37-cp37m-linux_x86_64.whl

sshleifer commented 4 years ago

Could I see the command you ran + more traceback/like what the ids were? Or could you try to reproduce the issue in google colab?

vladislavkoz commented 4 years ago
  1. When I'm trying to create a tokenizer with the following command: tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6") it fails with:

"special token {} has to be either str or AddedTokenFast but got: {}".format(key, type(value)) TypeError: special token mask_token has to be either str or AddedTokenFast but got: <class 'dict'>


  1. And here is the code snippet for another error message: tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-xsum-9-6") model = AutoModelWithLMHead.from_pretrained("sshleifer/distilbart-xsum-9-6") self.summarizer = pipeline("summarization", model=model, tokenizer=tokenizer) self.summarizer(text) # It fails here.

The "text" variable contains the following(It was working with the simple text part from Wikipedia but fails with the following one):

June 29, 2020 | Primary Care Collaborative July 22, 2020 | National Hispanic Medical Association July 29, 2020 | Business Health Coalition June 23, 2020 | The Hill June 25, 2020 June 24, 2020 | Primary Care Collaborative News Room Topic June 25, 2020 Primary care practices are projected to lose more than $65,000 in revenue per full-time physician in 2020, following drastic declines in office visits and fees for services from March to May during the COVID-19 pandemic, according to a... June 24, 2020 | Primary Care Collaborative In the wake of police brutality and pervasive racial injustice, which has spurred numerous, ongoing demonstrations across the country, the Primary Care Collaborative (PCC) reaffirms its commitment to racial equality. PCC underscores this... June 24, 2020 | Primary Care Collaborative On June 18, PCC joined many other leading organizations in the primary care community in an hour-long chat on Twitter about the current and future state of primary care during the coronavirus pandemic. If you missed the conversation, you... June 23, 2020 | The Hill Anthony Fauci, the nation's top infectious disease expert, said Tuesday that he thinks institutional racism has played a role in the disproportionate impact the coronavirus outbreak has had on the Black community in the U.S. "... June 20, 2020 WASHINGTON  —  Even as hospitals and physicians’ offices nationwide struggle to stay afloat amid the downturn caused by coronavirus, a small group of clinics is thriving, sustained by a model of care that many experts hope could reshape... June 18, 2020 | Primary Care Collaborative Check back weekly for the latest survey results and updates. For last week's data, see Week 13 Results. Who replied to the survey in Week 14? The Larry A. Green Center, the Primary Care Collaborative and 3rd Conversation are partnering... June 18, 2020 | PCPCC Press Release WASHINGTON (June 18, 2020) – The Larry A. Green Center, in collaboration with the Primary Care Collaborative (PCC) and 3rd Conversation, today released new data showing that more than 80 percent of primary care clinicians say professional... June 12, 2020 | The Commonwealth Fund On this episode of The Dose podcast, health policy expert Farzad Mostashari, M.D., who advises and supports hundreds of primary care practices across the country, explains what it will take to ensure doctors can continue caring for... June 12, 2020 | Primary Care Collaborative Six former leaders of the Centers for Medicare and Medicaid Services sent a joint letter June 10 to congressional leaders about the role of payment and regulatory flexibility in responding to the COVID-19 pandemic and addressing serious... June 12, 2020 | PR Newswire SAN FRANCISCO, June 12, 2020 -- Innovaccer, Inc., a leading healthcare technology company [and a PCC Executive Member] released its research-based report, titled "What COVID-19 Means to American Healthcare: Trends, Impacts, Predictions,... June 10, 2020 | Primary Care Collaborative Check back weekly for the latest survey results and updates. For last week's data, see Week 12 Results. Who replied to the survey in Week 13? A primary care clinician survey (weekly) and a patient survey (generally every other week) are... June 10, 2020 | PCPCC Press Release WASHINGTON (June 10, 2020) – The Larry A. Green Center, in collaboration with the Primary Care Collaborative (PCC) and 3rd Conversation, today released new data showing that a staggering 86 percent of Americans believe racism is impacting... June 4, 2020 | PCPCC Press Release WASHINGTON (June 4, 2020) – New survey data released today by the Larry A. Green Center, in collaboration with the Primary Care Collaborative (PCC) and 3rd Conversation, shows that over 70% of primary care patients are comfortable using... June 3, 2020 | Primary Care Collaborative Check back weekly for the latest survey results and updates. For last week's data, see Week 11 Results. Who replied to the survey in Week 12? A primary care clinician survey (weekly), and a patient survey (generally every other week) are... June 1, 2020 | The Hill The COVID-19 pandemic has unmasked many weaknesses in our public health and health care systems. But the outbreak also has accelerated, within weeks, useful health care innovations that would have normally taken years to develop. A strong... June 1, 2020 The week of June 1 is a time of national advocacy for primary care. The PCC and many other organizations are part of this campaign, called #saveprimarycare. We are reaching out to Congress and the administration to call for dedicated... May 27, 2020 | Primary Care Collaborative Check back weekly for the latest survey results and updates. For last week's data, see Week 10 Results. Who replied to the surveys? The Larry A. Green Center is now fielding two separate surveys: one to primary care clinicians, and a... May 27, 2020 WASHINGTON (May 27, 2020) – In new data released today by the Larry A. Green Center, in collaboration with 3rd Conversation and the Primary Care Collaborative (PCC), Americans report feeling “panicked, upset, or heartbroken” at the... May 21, 2020 WASHINGTON, May 21, 2020—In a new survey of primary care clinicians and their response to the COVID-19 pandemic, conducted May 15-18, more than half (55%) fear they are unprepared for the next wave of the pandemic due to high stress among... May 21, 2020 | Primary Care Collaborative Check back weekly for the latest survey results and updates. For last week's data, see Week 9 Results. Who replied to the survey in Week 10? The week 10 sample was much smaller (736) than last week’s sample and of relatively different... Pages

vladislavkoz commented 4 years ago

Any updates?

vladislavkoz commented 4 years ago

Looks like I've found an issue with: "special token {} has to be either str or AddedTokenFast but got: {}".format(key, type(value)) TypeError: special token mask_token has to be either str or AddedTokenFast but got: <class 'dict'>

The issue is fixed. The problem was in my local cache so now it works. But it still fails for the summarization using the text above.