labstructbioinf / pLM-BLAST

Detection of remote homology by comparison of protein language model representations
https://toolkit.tuebingen.mpg.de/tools/plmblast
MIT License
45 stars 5 forks source link

UnboundLocalError: local variable 'outfile' referenced before assignment #19

Closed Citugulia40 closed 1 year ago

Citugulia40 commented 1 year ago

Hi,

I have run pLM-BLAST on my fasta file containing 1.7 million fasta sequences to create embeddings of my query sequences

python embeddings.py start uniprot.fasta uniprot.pt

and I am getting below error, I have tried this on two servers but I am getting the same response each time.

Traceback (most recent call last):
  File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/torch/serialization.py", line 441, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/torch/serialization.py", line 668, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/0: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data2/ccitu/plmblast/../software/pLM-BLAST/embeddings.py", line 47, in <module>
    main_prottrans(df, args, batch_iter)
  File "/data2/ccitu/software/pLM-BLAST/embedders/prottrans.py", line 96, in main_prottrans
    torch.save(embeddings_filt, batch_id_filename)
  File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/torch/serialization.py", line 442, in save
    return
  File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/torch/serialization.py", line 291, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:337] . unexpected pos 2048 vs 1965

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data2/ccitu/plmblast/../software/pLM-BLAST/embeddings.py", line 52, in <module>
    capture_checkpoint(args, exception_msg = e)
  File "/data2/ccitu/software/pLM-BLAST/embedders/checkpoint.py", line 131, in capture_checkpoint
    with open(outfile, 'wt') as fp:
UnboundLocalError: local variable 'outfile' referenced before assignment

Please help me in solving this.

Thanks in advance

MiTRonGTE commented 1 year ago

Hi, try it:

python embeddings.py start uniprot.fasta uniprot.pt -embedder pt --gpu -bs 0

With so many sequences, GPU is a must (--gpu). It is also possible to use multiple GPUs with the -proc X option. Have you considered reducing the redundancy in this set?

If you expect sequences longer than 1000aa, use -t {max} where max is the longest sequence in your dataset. Avoid sequences longer than ~1000-1500. In fact, the problem you describe is related to insufficient GPU memory when processing a very long sequence.

-bs 0 turns on adaptive batch size mode and should be used by default. However, it is better to process the long sequences separately with adaptive batch size mode off -bs {n} where n is a fixed batch size (e.g. -bs 32 or less).

Citugulia40 commented 1 year ago

Thank you so much. I am able to run after splitting it into smaller files.

staszekdh commented 1 year ago

Also, remember to sort your input sequences by length (in the input FASTA file).

Citugulia40 commented 1 year ago

I have another issue in building database which contain 215 sequences

I have run

python embeddings.py start database.fasta database -embedder pt -bs 0 --asdir

And then python scripts/dbtofile.py database

But it is giving me an error

Traceback (most recent call last):
  File "/data2/ccitu/plmblast/../software/pLM-BLAST/scripts/dbtofile.py", line 8, in <module>
    dbfile = pd.read_csv(dbpath + '.csv')
  File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
    self.handles = get_handle(
  File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/pandas/io/common.py", line 859, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: 'database.csv'

Am I doing something wrong?

Thanks

staszekdh commented 1 year ago

Could you try now?

Citugulia40 commented 1 year ago

Still getting the same error. Will I clone the package again?

staszekdh commented 1 year ago

Yes, you need to do a git pull to get the latest changes.

Citugulia40 commented 1 year ago

I have successfully created emb.64 in database directory.

Then I am getting another error in python pLM-BLAST/scripts/plmblast.py database myseq0 output.csv --use_chunks

database is the database that I have just created and myseq0 is the myseq0.pt which I have created by

python embeddings.py myseq0.fasta myseq.pt

Error: Traceback (most recent call last): File "/data2/ccitu/plmblast/../software/pLM-BLAST/scripts/plmblast.py", line 122, in <module> db_df = read_input_file(db_index) File "/data2/ccitu/plmblast/../software/pLM-BLAST/embedders/base.py", line 340, in read_input_file raise FileNotFoundError(f'''could not find input file for{file}expecting one of the extensions .csv, .p, .pkl, .fas or .fasta''') FileNotFoundError: could not find input file fordatabaseexpecting one of the extensions .csv, .p, .pkl, .fas or .fasta

Argusmocny commented 1 year ago

This will probably fix this https://github.com/labstructbioinf/pLM-BLAST/issues/23#issuecomment-1777805451