tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"
https://code2vec.org
MIT License
1.1k stars 286 forks source link

How to change the code to work for multi-label classification? #138

Open allomy opened 2 years ago

allomy commented 2 years ago

I'm trying to use code2vec for multi-label classification, that one sample belongs to several labels, could you give some suggestions what to do with the model?

Thank you in advance for your help!

urialon commented 2 years ago

Hi @allomy , Thank you for your interest in code2vec!

I think that you can loss here: https://github.com/tech-srl/code2vec/blob/master/tensorflow_model.py#L228 from the standard cross entropy to sigmoid cross entropy: https://www.tensorflow.org/api_docs/python/tf/compat/v1/nn/sigmoid_cross_entropy_with_logits

But you will also need to change the pipeline to support reading multi-labeled examples. Follow the variable target_index here: https://github.com/tech-srl/code2vec/blob/master/path_context_reader.py and modify it to get a list of targets for every example.

Best, Uri

allomy commented 2 years ago

Hi @urialon , thank you for your quick response. I'll try it soon.

allomy commented 2 years ago

Hi @urialon , sorry for the delay response that I have tried to modify the code related to target_index, but was lost in the code... Could you give more information about modifying it to get a list of targets for every sample? Thank you in advance for your help.

urialon commented 2 years ago

Hi @allomy , Actually it might be easiest for you to use https://code2seq.org/ . It predicts a sequence of labels and not multi-label, but it may either be a good approximation, or easier to adapt for multi-label (just change the loss computation, not the entire data reading pipeline).

Best, Uri

allomy commented 2 years ago

Thank you @urialon , I will take a look at code2seq.