Closed maovshao closed 1 year ago
Thank you for bringing this to our attention. We will look into it.
By the way, the commond embeddings.py database.csv database -embedder pt -cname sequence --gpu -bs -1 --asdir
looks weird.
python embeddings.py
but embeddings.py
.-bs -1
seems to cause the following error.
Traceback (most recent call last):
File "embeddings.py", line 14, in -bs 1
, it worked fine. I don't know if I understand it wrong here.Thanks for the info. Indeed, there seems to be a problem with the adaptative batch size. We will fix it soon.
Hi guys, great to see your replies, thanks again for sharing such awesome work. In fact, I just ran through the pLM-BLAST Pipeline completely, including the pairwise alignment based on "Use in Python---Simple example" and "Searching a database". Next, I will share my steps, some problems I encountered, and my temporary solutions, hoping to help you improve your work.
In fact, this was done so successfully that I had no problems here.
I encountered some difficulties in this pipeline, next I will share my steps, the problems I encountered and my temporary solution.
X
Error
#When making a database:
warnings.warn(f"{seq.id} has characters that do not encode amino acids. The Sequence has not been added", UnexpectedCharSeq)
and further lead to
ValueError: The length of the embedding file and the sequence df are different:
#When searching:
./scripts/run_plm_blast.py", line 160, in aa_to_group
assert False
Solution:
Add X
to the following two pieces of code
if set(seq.seq) - set('QWERTYIPASDFGHKLCVNMX') != set():
print(set(seq.seq) - set('QWERTYIPASDFGHKLCVNMX'))
def aa_to_group(aa):
for pos, g in enumerate(['GAVLI', 'FYW', 'CM', 'ST', 'KRH', 'DENQ', 'P', '-X']):
g = list(g)
if aa in g: return pos
assert False
Error
Traceback (most recent call last):
File "embedders/parser.py", line 170, in make_iterator
if startbatch[-1] != seqnum:
IndexError: list index out of range
Solution: After I changed to -bs 1, it worked fine. I don't know if I understand it wrong here.
Error
RuntimeError: Calculated padded input size per channel: (25 x 64). Kernel size: (30 x 64). Kernel size can't be greater than actual input size
Solution: Change the kernel_size below to 25
def chunk_cosine_similarity(query : th.Tensor,
targets : List[th.Tensor],
quantile, dataset_files : List[str],
stride = 3, kernel_size = 25) -> List[dict]:
embeddings.py
should be python embeddings.py
; dbtofile.py
should be scripts/dbtofile.py
; name query_emb
is not defined; exceptions caused by the inability to automatically process fasta files with a length exceeding 1000
, etc.
Overall, pLM_BLAST is a very good job, correcting the above issues will hopefully improve the user-friendliness of plm_blast and make it even better. Thank you all again and have a nice day.:relaxed:
Hi, @maovshao thanks for your feedback and involvement :). All bugs should be fixed for now. Comments:
-1
was misleading it should be set 0
for which embeddings.py
uses adaptative batch size to speed embedding calculations. The description of embeddings.py
now shows correct help messagechunk_cosine_similarity
should now set kernel_size
by its own - kernel must be >= then shortest sequenceI will leave this issue open for a while if you have any suggestions or questions related to this fill free to ask. :) @ @
Thanks again for your quick response as always. I think this work is getting better. ☺️
However, 'X' is common in many protein datasets.
Thanks for your response!