huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.73k stars 26.71k forks source link

No max_length set on huawei-noah/TinyBERT_General_4L_312D/config.json #12052

Closed alexcombessie closed 3 years ago

alexcombessie commented 3 years ago

Environment info

Who can help

@patrickvonplaten @JetRunner

Information

Model I am using: huawei-noah/TinyBERT_General_4L_312D

The problem arises when using:

import pandas as pd import gzip

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('huawei-noah/TinyBERT_General_4L_312D')

def parse(path): g = gzip.open(path, 'rb') for l in g: yield json.loads(l)

def getDF(path): i = 0 df = {} for d in parse(path): df[i] = d i += 1 return pd.DataFrame.from_dict(df, orient='index')

local_path_to_review_data = "/Users/alexandrecombessie/Downloads/Software_5.json.gz" # See download link below df = getDF(local_path_to_review_data)

df["review_text_full_embeddings"] = [ json.dumps(x.tolist()) for x in model.encode(df["reviewText"].astype(str)) ]



The tasks I am working on is:
* [x] my own task or dataset: (give details below)
- Amazon review dataset sample (http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Software_5.json.gz)

## To reproduce

Steps to reproduce the behavior:

See script above

## Expected behavior

A `max_length` should be set in the model `config.json` for the tokenizer to apply truncation (which is my expected behavior).
See https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D/blob/main/config.json

I could do it myself, but I am not able to understand what is the right length to set.
alexcombessie commented 3 years ago

Hi @patrickvonplaten @JetRunner,

Apologies for following up, I know it's a busy time.

Would you have some time to look into this issue?

Thanks,

Alex

JetRunner commented 3 years ago

Hi Alex, I think the right thing to do is to look up max_len from the TinyBERT paper. Do you know what is that setting?

alexcombessie commented 3 years ago

Hi Alex, I think the right thing to do is to look up max_len from the TinyBERT paper. Do you know what is that setting?

Yeah, you are right. The paper seems to indicate 128 for the general distillation.

Screenshot 2021-06-11 at 11 12 49

I will reach out to the authors because they mention another length of 64 for task-specific distillation. I just want to be sure which one is used by the model hosted on Huggingface.

As a side-note, it would be really useful (at least to me) to have some automated checks and/or feedback system on the model hub.

alexcombessie commented 3 years ago

Hi @JetRunner,

I got the following answer from the author (Xiaoqi Jiao)

The max_len of TinyBERT is 128, but if the max sequence length of your downstream task is less than max_len, you may set max_len to a small value like 64 to save the computing resources.

Should I add max_length: 128 on the model hub? Happy to take this small PR directly.

Cheers,

Alex

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.