microsoft / CodeBERT

CodeBERT
MIT License
2.19k stars 450 forks source link

LongCoder encoder #276

Open boitavoi opened 1 year ago

boitavoi commented 1 year ago

Hey!

The LongCoder work is super impressive and important, thank you for that. I was curious, is it possible to use LongCoder encoder part for obtaining embeddings only for long (>2048 tokens) source code snippets ? Currently I use UniXcoder for my research, but I need to handle longer code snippets, is it possible to use LongCoder for embeddings somehow?

guoday commented 1 year ago

It's hard to use LongCoder encoder part for obtaining embeddings only for long source code snippets, because I modify the code that can only supports decoder-only mode.

If you need, I can provide you the script code to convert unixcoder model to longformer model, so that you can use longformer model that are initialized from unixcoder to handle longer code snippets.

guoday commented 1 year ago

convert.py.zip

boitavoi commented 1 year ago

convert.py.zip

Thank you! this is indeed helpful :) Does it require additional training\fine-tuning? or I can use the longformer after conversion as is?

guoday commented 1 year ago

After conversion, you can directly use the longformer without additional pre-training. However, it needs to fine-tune on downstream tasks.

SasCezar commented 1 year ago

@guoday I used the convert script; however, I have issues using the converted model.

This is what I tried:

from transformers import LongformerConfig, RobertaTokenizer, pipeline
from models.longcoder import LongcoderModel

config = LongformerConfig.from_pretrained('/path-to-models/longformer-unixcoder')
tokenizer = RobertaTokenizer.from_pretrained('/path-to-models/longformer-unixcoder')
longcoder = LongcoderModel.from_pretrained('/path-to-models/longformer-unixcoder',config=config)

embedding = pipeline('feature-extraction', model=longcoder, tokenizer=tokenizer)

func = ("def f(a,b): if a>b: return a else return b")
embedding(func)

Then I get the following error:

AttributeError: 'LongformerConfig' object has no attribute 'is_decoder_only'

If I don't use the pipeline(), switching to the following code:

tokens=tokenizer.tokenize("return maximum value")
longcoder(tokens)

I get this error:

AttributeError: 'str' object has no attribute 'size'