Open boitavoi opened 1 year ago
It's hard to use LongCoder encoder part for obtaining embeddings only for long source code snippets, because I modify the code that can only supports decoder-only mode.
If you need, I can provide you the script code to convert unixcoder model to longformer model, so that you can use longformer model that are initialized from unixcoder to handle longer code snippets.
Thank you! this is indeed helpful :) Does it require additional training\fine-tuning? or I can use the longformer after conversion as is?
After conversion, you can directly use the longformer without additional pre-training. However, it needs to fine-tune on downstream tasks.
@guoday I used the convert script; however, I have issues using the converted model.
This is what I tried:
from transformers import LongformerConfig, RobertaTokenizer, pipeline
from models.longcoder import LongcoderModel
config = LongformerConfig.from_pretrained('/path-to-models/longformer-unixcoder')
tokenizer = RobertaTokenizer.from_pretrained('/path-to-models/longformer-unixcoder')
longcoder = LongcoderModel.from_pretrained('/path-to-models/longformer-unixcoder',config=config)
embedding = pipeline('feature-extraction', model=longcoder, tokenizer=tokenizer)
func = ("def f(a,b): if a>b: return a else return b")
embedding(func)
Then I get the following error:
AttributeError: 'LongformerConfig' object has no attribute 'is_decoder_only'
If I don't use the pipeline()
, switching to the following code:
tokens=tokenizer.tokenize("return maximum value")
longcoder(tokens)
I get this error:
AttributeError: 'str' object has no attribute 'size'
Hey!
The LongCoder work is super impressive and important, thank you for that. I was curious, is it possible to use LongCoder encoder part for obtaining embeddings only for long (>2048 tokens) source code snippets ? Currently I use UniXcoder for my research, but I need to handle longer code snippets, is it possible to use LongCoder for embeddings somehow?