nicolay-r / ARElight

Granular Viewer of Sentiments Between Entities in Massively Large Documents and Collections of Texts, powered by AREkit
https://link.springer.com/chapter/10.1007/978-3-031-56069-9_23
MIT License
37 stars 2 forks source link

`infer_bert` -- raises "Attempt to free invalid pointer" on loading and inferring tensorflow model #49

Closed edloginova closed 1 year ago

edloginova commented 2 years ago

When running

python infer_bert.py --from-files ../data/texts-inosmi-rus/e1.txt \
    --labels-count 3 \
    --terms-per-context 50 \
    --tokens-per-context 128 \
    --text-b-type nli_m \
    -o output/brat_inference_output

I get

...
INFO:tensorflow:Restoring parameters from /content/ARElight/data/models/ra-20-srubert-large-neut-nli-pretrained-3l-finetuned/ra-20-srubert-large-neut-nli-pretrained-3l
  0%|                                                                                       | 0/1253 [00:00<?, ?opins/s]src/tcmalloc.cc:283] Attempt to free invalid pointer 0x107e00000 

and the process freezes.

Google colab, Python 3.7, tensorflow 1.15.0, numpy 1.21.6, deeppavlov 0.11.0, arekit installed from git. Tried restarting the runtime, doesn't help.

nicolay-r commented 2 years ago

Hi @edloginova, and thanks for reporting on so At first, I just give it a try to reproducing so on my side under the following environment, since hope you may find something out of as well that may address on your issue. Configuration: Ubuntu 18.04 (Linux Mint), Python 3.6.9, pip-freeze-list, NVidia-GTX-1060 (6GB)

2022-11-21 14:58:32.53 INFO in 'deeppavlov.core.models.tf_model'['tf_model'] at line 51: [loading model from /media/nicolay/96ed6537-b931-4f7e-8ac4-8407527ddbf9/proj/REmarker/data/models/ra-20-srubert-large-neut-nli-pretrained-3l-finetuned/ra-20-srubert-large-neut-nli-pretrained-3l]
INFO:deeppavlov.core.models.tf_model:[loading model from /media/nicolay/96ed6537-b931-4f7e-8ac4-8407527ddbf9/proj/REmarker/data/models/ra-20-srubert-large-neut-nli-pretrained-3l-finetuned/ra-20-srubert-large-neut-nli-pretrained-3l]
INFO:tensorflow:Restoring parameters from /media/nicolay/96ed6537-b931-4f7e-8ac4-8407527ddbf9/proj/REmarker/data/models/ra-20-srubert-large-neut-nli-pretrained-3l-finetuned/ra-20-srubert-large-neut-nli-pretrained-3l
100%|██████████████████████████████████████████████████████████████████████████| 1253/1253 [00:01<00:00, 1004.98opins/s]
Calculating rows count (sample [DataType.Test]): 0rows [00:00, ?rows/s]2022-11-21 14:58:38.559124: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
Calculating rows count (sample [DataType.Test]): 44rows [00:02, 16.33rows/s]
INFO:arekit.common.data.storages.base:Filling with blank rows: 44
INFO:arekit.common.data.storages.base:Completed!
sample [DataType.Test]: 100%|███████████████████████████████████████████████████████████| 44/44 [00:01<00:00, 43.48it/s]
INFO:arekit.common.data.input.writers.tsv:Saving... (44, 10): /media/nicolay/96ed6537-b931-4f7e-8ac4-8407527ddbf9/proj/REmarker/examples/args/../../_output/sample-test-0.tsv.gz
INFO:arekit.common.data.input.writers.tsv:Saving completed!

Writing output: 44rows [00:01, 35.72rows/s]
1it [00:00, 119.02it/s]

Got this result.zip

REmarker is just an old title of the project

It seems to be Tensorflow issue and attempt to allocate memory by deeppavlov on so. Am I right that it attempts allocate the memory on GPU device, and amount of memory is sufficient? (6GB+) My assumption here is that deeppavlov tries to restore model on CPU which may take a while if the latter is actually possible

edloginova commented 2 years ago

It's running on Tesla T4 with 15 109 MiB, I am afraid. I reinstalled tensorflow to match your version, but it doesn't seem to fix things. Shall I ask deeppavlov community whether it is on their side?

nicolay-r commented 2 years ago

No, i think you should not since this is not because of an issue in their code, but more closer to something low-level, i.e. tensorflow in combination with colab. You're not the only who encountered related with it... I will take a look in a details and once find something will let my advice here on so

You may also check for gpu availability from tensorflow and nvidia-smi to guarantee everything is ok with GPU from netebook side

edloginova commented 2 years ago

Yes, I checked it with nvidia-smi, it was there and free, and the tf command returns True. Thank you for quick responses! <3

nicolay-r commented 2 years ago

Well, I would love to assist you more then, however I am lack of other solutions on so for now

That might also falls onto new cudnn and cuda drivers I think. I have relatively old: NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 cudnn 7.6.5

nicolay-r commented 2 years ago

@edloginova, as an alternative solution I have reproduced the same but torch/transformer based pretrained state, created with OpenNRE framework. Once you have a time or already familiar, you may down for OpenNRE

Conider the following labels conversion rel2id required by OpenNRE:

{"0": 0, "1": 1, "2": 2}
nicolay-r commented 1 year ago

@edloginova, may I kindly asked you whether you finally sort it out or it is still challenging?

edloginova commented 1 year ago

I'm afraid I haven't figured it out yet :( Your help would be greatly appreciated, if you have time!

nicolay-r commented 1 year ago

Okay, thanks for letting me know! Will have a look on a spare time. By the way, I've noticed you use python 3.7, which is according to my personal experience might be incompatible for tensorflow (backend for deeppavlov). That was the reason I was down for 3.6.9. Here is the routine I am using for colab in order to switch to 3.6.9 among other alternatives:

!sudo update-alternatives --config python3
!wget https://bootstrap.pypa.io/pip/3.6/get-pip.py
!python get-pip.py

I will have a look on a spare time, and keep update once give it a try to test it.

edloginova commented 1 year ago

I switched to 3.6, but afraid to report it's still the same error :(

nicolay-r commented 1 year ago

@edloginova , please try sudo prefixed and very-likely this should help you out:

!sudo python infer_bert.py ...

(DeepPavlov and AREkit keeps data at /root/.deeppavlov and /root/.arekit; Suppose to be a problem of reading AREkit resources on my side)

Ps: wish you all the best and even greater advances in 2023 🎉🎄

edloginova commented 1 year ago

@nicolay-r IT WORKS! Thank you so much :))) I should have thought of that myself... Thank you for your patience! Best wishes to you, too! You're doing amazing work :)

nicolay-r commented 1 year ago

@edloginova , thanks for your interest and feedback on so, and kind wishes! Feel free and don't hesitate to contact me in case of other questions ✨