ieeta-pt / BioNExt

A Biomedical Novelty Relation Extractor System
MIT License
3 stars 0 forks source link

How to work around a case of multiple GPUs. #1

Closed yayamamo closed 1 month ago

yayamamo commented 1 month ago

Hi, thank you for developing a nice tool. I tried this tool on a multiple GPU environment, and it said AssertionError: CRF is not prepared to run as SPMD.. Could you tell me how to work around this case?

Thanks.

T-Almeida commented 1 month ago

Hi @yayamamo,

Thank you for your interest in our tool.

I recall personally adding that assertion to prevent future complications with the CRF when running as Single Program Multiple Data (SPMD). This is the default behaviour (in PyTorch and Transformers) when a system has more than one GPU available. I don't remember the exact problem, but I believe it was related to a gather operation at inference time. (Training works, although it's also disabled because it would crash during the validation step of the training loop.) Given that our setup only uses one GPU, I decided not to invest effort in solving this issue.

Let's explore some workarounds

The trick to make it work would be to allow the program to use only one GPU. A simple way to accomplish this is to set the CUDA_VISIBLE_DEVICES environment variable (which indicates how many devices the program is allowed to use).

Example:

$ CUDA_VISIBLE_DEVICES=0 python main.py PMID:36516090 -t

The downside is that it will only use 1 GPU (the gpu with the id 0). To use multiple GPUs simultaneously, you need to manually instantiate multiple Python scripts, each using a different available GPU:

$ CUDA_VISIBLE_DEVICES=0 python main.py PMID:36516090 -t
$ CUDA_VISIBLE_DEVICES=1 python main.py PMID:36516091 -t
$ CUDA_VISIBLE_DEVICES=2 python main.py PMID:36516092 -t 
$ CUDA_VISIBLE_DEVICES=3 python main.py PMID:36516093 -t

Here, you're running 4 instances of the tool simultaneously. For example, if you have 4 documents to annotate, each instance can simultaneously annotate one of the four. In terms of performance, it should be similar to use SPMD. The only problem is that it must be the user to distribute the data input and gather the data output.

Note that the command I've chosen to use will just run the tagger.

I hope this helps. Let me know if you need any further clarification.

EDIT:

I'm actually considering changing the assert to a warning and automatically forcing the program to continue using only one GPU. I believe this would make more sense, since it would simplify the users life.

yayamamo commented 1 month ago

Thank you for telling me. It worked, but I'm not sure whether it was fine or not as getting the following messages.

$ CUDA_VISIBLE_DEVICES=4 python main.py dataset/bc8_biored_task2_test.json
...
broken
writing to outputs/extractor/bc8_biored_task2_test.json
$ CUDA_VISIBLE_DEVICES=4 python main.py PMID:36516090 -t
...
[Tagger]
Running
Token indices sequence length is longer than the specified maximum sequence length for this model (566 > 512). Running this sequence through the model will result in indexing errors
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.42it/s]
DOCUMENTS: 1
$ CUDA_VISIBLE_DEVICES=4 python main.py PMID:36516093 -t
...
[Tagger]
Running
Traceback (most recent call last):
  File "/home/yayamamo/git/BioNExt/main.py", line 116, in <module>
    input_file = module.run(input_file)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yayamamo/git/BioNExt/src/tagger/__init__.py", line 86, in run
    test_ds = load_inference_data(testset, tokenizer=self.tokenizer, context_size=self.config.context_size)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yayamamo/git/BioNExt/src/data.py", line 43, in load_inference_data
    test_data = load_data(file_path)
                ^^^^^^^^^^^^^^^^^^^^
  File "/home/yayamamo/git/BioNExt/src/data.py", line 30, in load_data
    entities = i['passages'][0]['annotations']+i['passages'][1]['annotations']
                                               ~~~~~~~~~~~~~^^^
IndexError: list index out of range
T-Almeida commented 1 month ago

Upon initial inspection, everything appears to be in order. However, I'll rerun the bc8_biored_task2_test.json to confirm that the "broken" message is expected (wait for updates). I don't recall the exact details, but this issue may be due to some fake documents that lack annotations. The testset is composed of both gold standard documents and these fake documents (to make it impossible to do human annotation during the challenge).

The messageToken indices sequence length is longer than the specified maximum sequence length for this model (566 > 512). Running this sequence through the model will result in indexing errors is just a warning from the Hugging Face tokenizer and can be safely ignored. We never run sequences larger than 512 tokens through the model, as we manually chunk the input size before feeding it to the model.

Regarding the following crash:

Traceback (most recent call last):
  File "/home/yayamamo/git/BioNExt/main.py", line 116, in <module>
    input_file = module.run(input_file)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yayamamo/git/BioNExt/src/tagger/__init__.py", line 86, in run
    test_ds = load_inference_data(testset, tokenizer=self.tokenizer, context_size=self.config.context_size)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yayamamo/git/BioNExt/src/data.py", line 43, in load_inference_data
    test_data = load_data(file_path)
                ^^^^^^^^^^^^^^^^^^^^
  File "/home/yayamamo/git/BioNExt/src/data.py", line 30, in load_data
    entities = i['passages'][0]['annotations']+i['passages'][1]['annotations']
                                               ~~~~~~~~~~~~~^^^
IndexError: list index out of range

We always expect documents in BioC format and to contain both a title and an abstract. However, the document with PMID:36516093 (https://pubmed.ncbi.nlm.nih.gov/36516093/) contains only a title and no abstract, and the crash is because of this. I only tested PMID:36516090; the other PMIDs were simply incremented by one to serve as examples, and I did not verify if these documents were valid. It was unfortunate that PMID:36516093 happened to lack an abstract.

When I have time I will add a check to verify if the document contains a title and an abstract before running.

Hope this helps.

T-Almeida commented 1 month ago

Hi @yayamamo,

The output you received is as expected. If you'd like to compare it with mine, I'll provide it at the end. Additionally, the "broken" message appears because one of the documents has no relations. This can be safely ignored and has fixed on commit fe4e833f0cb8c4936827ccef00aaaca8259f003d.

$ CUDA_VISIBLE_DEVICES=0 python main.py testset/bc8_biored_task2_test.json
/data/home/tiagomeloalmeida/BioNExt/virtual-venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/data/home/tiagomeloalmeida/BioNExt/virtual-venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Token indices sequence length is longer than the specified maximum sequence length for this model (547 > 512). Running this sequence through the model will result in indexing errors
Found all of the dataset files
Found all of the kb Cellosaurus files
Found all of the kb CTD-diseases files
Found all of the kb dbSNP (tmVar3) files
Found all of the kb MeSH files
Found all of the kb NCBI-Gene files
Found all of the NCBI-Taxonomy files
Pipeline built
[Tagger, Linker, Extractor]
Running
100%|██████████| 1335/1335 [05:38<00:00,  3.94it/s]
DOCUMENTS: 10000
load training data and kbases for taxonomy
100%|██████████| 10000/10000 [00:00<00:00, 199775.38it/s]
number of predicted species: 28737
load training data and kbases for chemicals
100%|██████████| 10000/10000 [04:48<00:00, 34.63it/s]
number of predicted chemicals: 57419
load training data and kbases for diseases
100%|██████████| 10000/10000 [02:52<00:00, 58.00it/s]
number of predicted diseases: 58692
load training data and kbases for genes
100%|██████████| 48880/48880 [02:01<00:00, 401.20it/s]
Loaded genes emb files dict_keys(['9606', '10116', '11676', '12814', '3702', '10090', '7955'])
100%|██████████| 10000/10000 [02:35<00:00, 64.50it/s]
number of predicted gene: 73424
load training data and kbases for seq_variant
23it [00:01, 14.59it/s]
100%|██████████| 10000/10000 [26:20<00:00,  6.33it/s]
/data/home/tiagomeloalmeida/BioNExt/virtual-venv/lib/python3.11/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
number of predicted seq var: 5961
load training data and kbases for cells
running normalization
100%|██████████| 10000/10000 [00:13<00:00, 752.47it/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (575 > 512). Running this sequence through the model will result in indexing errors
number of predicted cell: 1543 3257
number of total ann: 302406 after clean: 221525
load_data_for_inference 305143
333983 128
2610it [2:04:34,  2.86s/it]
broken
writing to outputs/extractor/bc8_biored_task2_test.json
T-Almeida commented 1 month ago

Closing the issue do to inactivity, feel free to reopen.