Closed alexcombessie closed 3 years ago
Hi @patrickvonplaten @JetRunner,
Apologies for following up, I know it's a busy time.
Would you have some time to look into this issue?
Thanks,
Alex
Hi Alex, I think the right thing to do is to look up max_len
from the TinyBERT paper. Do you know what is that setting?
Hi Alex, I think the right thing to do is to look up
max_len
from the TinyBERT paper. Do you know what is that setting?
Yeah, you are right. The paper seems to indicate 128 for the general distillation.
I will reach out to the authors because they mention another length of 64 for task-specific distillation. I just want to be sure which one is used by the model hosted on Huggingface.
As a side-note, it would be really useful (at least to me) to have some automated checks and/or feedback system on the model hub.
Hi @JetRunner,
I got the following answer from the author (Xiaoqi Jiao)
The max_len of TinyBERT is 128, but if the max sequence length of your downstream task is less than max_len, you may set max_len to a small value like 64 to save the computing resources.
Should I add max_length: 128
on the model hub? Happy to take this small PR directly.
Cheers,
Alex
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 3.0.2Who can help
@patrickvonplaten @JetRunner
Information
Model I am using: huawei-noah/TinyBERT_General_4L_312D
The problem arises when using:
import pandas as pd import gzip
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('huawei-noah/TinyBERT_General_4L_312D')
def parse(path): g = gzip.open(path, 'rb') for l in g: yield json.loads(l)
def getDF(path): i = 0 df = {} for d in parse(path): df[i] = d i += 1 return pd.DataFrame.from_dict(df, orient='index')
local_path_to_review_data = "/Users/alexandrecombessie/Downloads/Software_5.json.gz" # See download link below df = getDF(local_path_to_review_data)
df["review_text_full_embeddings"] = [ json.dumps(x.tolist()) for x in model.encode(df["reviewText"].astype(str)) ]