tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"
http://code2seq.org
MIT License
555 stars 164 forks source link

Tensor proto whose content is larger than 2GB #102

Closed Sohaib90 closed 3 years ago

Sohaib90 commented 3 years ago

Hello,

Thank you so much for the repo. I know I have already asked some questions, and I appreciate you answering them so promptly. Really appreciate the help.

Anyways, I have a custom dataset which comprises of around 1.3 million instances for training. When running the train.sh script, it gives me a ValueError that I cannot understand why is it happening. Can you help me. The error trace is given below.

2021-08-23 12:48:03.381177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2021-08-23 12:48:03.778873: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-08-23 12:48:03.778918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2021-08-23 12:48:03.778928: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2021-08-23 12:48:03.779059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10426 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, compute capability: 6.1)

Dictionaries loaded. Loaded subtoken vocab. size: 186279 Loaded target word vocab. size: 26350 Loaded nodes vocab. size: 46899960 Created model Starting training Traceback (most recent call last): File "code2seq.py", line 39, in model.train() File "/local/home/aru7rng/masterthesis/code2seq/model.py", line 79, in train config=self.config) File "/local/home/aru7rng/masterthesis/code2seq/reader.py", line 41, in init self.node_table = Reader.get_node_table(node_to_index) File "/local/home/aru7rng/masterthesis/code2seq/reader.py", line 60, in get_node_table cls.class_node_table = cls.initialize_hash_map(node_to_index, node_to_index[Common.UNK]) File "/local/home/aru7rng/masterthesis/code2seq/reader.py", line 68, in initialize_hash_map value_dtype=tf.int32), default_value) File "/opt/dl/anaconda3/envs/tf112/lib/python3.6/site-packages/tensorflow/python/ops/lookup_ops.py", line 346, in init self._keys = ops.convert_to_tensor(keys, dtype=key_dtype, name="keys") File "/opt/dl/anaconda3/envs/tf112/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1050, in convert_to_tensor as_ref=False) File "/opt/dl/anaconda3/envs/tf112/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1146, in internal_convert_to_tensor ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref) File "/opt/dl/anaconda3/envs/tf112/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 229, in _constant_tensor_conversion_function return constant(v, dtype=dtype, name=name) File "/opt/dl/anaconda3/envs/tf112/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 214, in constant name=name).outputs[0] File "/opt/dl/anaconda3/envs/tf112/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/opt/dl/anaconda3/envs/tf112/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/opt/dl/anaconda3/envs/tf112/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1713, in init "Cannot create a tensor proto whose content is larger than 2GB.") ValueError: Cannot create a tensor proto whose content is larger than 2GB.

urialon commented 3 years ago

Hi @Sohaib90 , It looks like the Nodes vocabulary is huge: 46899960. Typically, there should be 100-1000~ types of nodes. Creating such a huge vocabulary creates a huge tensor to hold their embeddings

Do you have an idea of how did this happen?

Sohaib90 commented 3 years ago

At first I was using a smaller dataset. When I increased the training dataset by two folds, that is when this error is raised. Is there a way to restrict the node vocabulary?

urialon commented 3 years ago

Did you modify the JavaExtractor? Or used ours as-is?

Sohaib90 commented 3 years ago

I am using code2seq for C code. For preprocessing I use https://github.com/AmeerHajAli/code2vec_c for parsing and creating the dataset. This same strategy works when I have a dataset with amounts to around ~10 GB of training data, but not for larger datasets

urialon commented 3 years ago

I am afraid there is a bug in the C extractor, that creates too many kinds of node types, and makes the node vocabulary explode.

Maybe @AmeerHajAli has an idea why does this happen?

urialon commented 3 years ago

Closing due to inactivity. I cannot support projects that are not mine...

Avv22 commented 2 years ago

@Sohaib90. I have the same issue with Python. I have 16 GB of RAM but still unable to train the model on 1 GB of both train and test datasets. What would you recommend please?

urialon commented 2 years ago

Hi @Sohaib90 , We just released a model that performs better than OpenAI's Codex for C.

https://arxiv.org/pdf/2202.13169.pdf https://github.com/VHellendoorn/Code-LMs

Best, Uri