Closed Citugulia40 closed 1 year ago
Hi, try it:
python embeddings.py start uniprot.fasta uniprot.pt -embedder pt --gpu -bs 0
With so many sequences, GPU is a must (--gpu
). It is also possible to use multiple GPUs with the -proc X
option. Have you considered reducing the redundancy in this set?
If you expect sequences longer than 1000aa, use -t {max}
where max is the longest sequence in your dataset. Avoid sequences longer than ~1000-1500. In fact, the problem you describe is related to insufficient GPU memory when processing a very long sequence.
-bs 0
turns on adaptive batch size mode and should be used by default. However, it is better to process the long sequences separately with adaptive batch size mode off -bs {n}
where n is a fixed batch size (e.g. -bs 32
or less).
Thank you so much. I am able to run after splitting it into smaller files.
Also, remember to sort your input sequences by length (in the input FASTA file).
I have another issue in building database which contain 215 sequences
I have run
python embeddings.py start database.fasta database -embedder pt -bs 0 --asdir
And then
python scripts/dbtofile.py database
But it is giving me an error
Traceback (most recent call last):
File "/data2/ccitu/plmblast/../software/pLM-BLAST/scripts/dbtofile.py", line 8, in <module>
dbfile = pd.read_csv(dbpath + '.csv')
File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 577, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
self._engine = self._make_engine(f, self.engine)
File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1661, in _make_engine
self.handles = get_handle(
File "/home/ccitu/miniconda3/envs/plmblast/lib/python3.9/site-packages/pandas/io/common.py", line 859, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: 'database.csv'
Am I doing something wrong?
Thanks
Could you try now?
Still getting the same error. Will I clone the package again?
Yes, you need to do a git pull
to get the latest changes.
I have successfully created emb.64 in database directory.
Then I am getting another error in python pLM-BLAST/scripts/plmblast.py database myseq0 output.csv --use_chunks
database is the database that I have just created and myseq0 is the myseq0.pt which I have created by
python embeddings.py myseq0.fasta myseq.pt
Error:
Traceback (most recent call last): File "/data2/ccitu/plmblast/../software/pLM-BLAST/scripts/plmblast.py", line 122, in <module> db_df = read_input_file(db_index) File "/data2/ccitu/plmblast/../software/pLM-BLAST/embedders/base.py", line 340, in read_input_file raise FileNotFoundError(f'''could not find input file for
{file}expecting one of the extensions .csv, .p, .pkl, .fas or .fasta''') FileNotFoundError: could not find input file for
databaseexpecting one of the extensions .csv, .p, .pkl, .fas or .fasta
This will probably fix this https://github.com/labstructbioinf/pLM-BLAST/issues/23#issuecomment-1777805451
Hi,
I have run pLM-BLAST on my fasta file containing 1.7 million fasta sequences to create embeddings of my query sequences
python embeddings.py start uniprot.fasta uniprot.pt
and I am getting below error, I have tried this on two servers but I am getting the same response each time.
Please help me in solving this.
Thanks in advance