Errors on creating a target dataset and searching against it.

maovshao commented 1 year ago

Fail to compile a FASTA sequence with 'X' token Error report: warnings.warn(f"{seq.id} has characters that do not encode amino acids. The Sequence has not been added", UnexpectedCharSeq) {'X'} And then it caused the following problem when i try to create a target dataset: ValueError: The length of the embedding file and the sequence df are different.

However, 'X' is common in many protein datasets.

Mistake in ./scripts/run_plm_blast.py Traceback (most recent call last): File "...", line 214, in raise ValueError(f'The length of the embedding file and the sequence df are different: {query_df.shape[0]} != {len(query_emb)}') NameError: name 'query_emb' is not defined.

Thanks for your response!

staszekdh commented 1 year ago

Thank you for bringing this to our attention. We will look into it.

maovshao commented 1 year ago

By the way, the commond embeddings.py database.csv database -embedder pt -cname sequence --gpu -bs -1 --asdir looks weird.

Why not python embeddings.py but embeddings.py.
In addition, the setting of -bs -1 seems to cause the following error. Traceback (most recent call last): File "embeddings.py", line 14, in df, num_batches = prepare_dataframe(df, args. batch_size, args. truncate) File "embedders/parser.py", line 153, in prepare_dataframe batch_iterator = make_iterator(df['seqlens'].tolist(), batch_size) File "embedders/parser.py", line 170, in make_iterator if startbatch[-1] != seqnum: IndexError: list index out of range After I changed to -bs 1, it worked fine. I don't know if I understand it wrong here.

staszekdh commented 1 year ago

Thanks for the info. Indeed, there seems to be a problem with the adaptative batch size. We will fix it soon.

maovshao commented 1 year ago

Hi guys, great to see your replies, thanks again for sharing such awesome work. In fact, I just ran through the pLM-BLAST Pipeline completely, including the pairwise alignment based on "Use in Python---Simple example" and "Searching a database". Next, I will share my steps, some problems I encountered, and my temporary solutions, hoping to help you improve your work.

1. Use in Python---Simple example

In fact, this was done so successfully that I had no problems here.

2. Searching a database

I encountered some difficulties in this pipeline, next I will share my steps, the problems I encountered and my temporary solution.

Failed to make database, cannot recognize X

Error

#When making a database:
warnings.warn(f"{seq.id} has characters that do not encode amino acids. The Sequence has not been added", UnexpectedCharSeq)
and further lead to
ValueError: The length of the embedding file and the sequence df are different:

#When searching:
./scripts/run_plm_blast.py", line 160, in aa_to_group
     assert False

Solution: Add X to the following two pieces of code

if set(seq.seq) - set('QWERTYIPASDFGHKLCVNMX') != set():
    print(set(seq.seq) - set('QWERTYIPASDFGHKLCVNMX'))

def aa_to_group(aa):
    for pos, g in enumerate(['GAVLI', 'FYW', 'CM', 'ST', 'KRH', 'DENQ', 'P', '-X']):
        g = list(g)
        if aa in g: return pos
    assert False

batch_size is set to -1

Error

Traceback (most recent call last):
File "embedders/parser.py", line 170, in make_iterator
     if startbatch[-1] != seqnum:
     IndexError: list index out of range

Solution: After I changed to -bs 1, it worked fine. I don't know if I understand it wrong here.

Kernel size conflict

Error

RuntimeError: Calculated padded input size per channel: (25 x 64). Kernel size: (30 x 64). Kernel size can't be greater than actual input size

Solution: Change the kernel_size below to 25

def chunk_cosine_similarity(query : th.Tensor,
                             targets : List[th.Tensor],
                             quantile, dataset_files : List[str],
                             stride = 3, kernel_size = 25) -> List[dict]:

Some other small issues that may confuse users

embeddings.py should be python embeddings.py; dbtofile.py should be scripts/dbtofile.py; name query_emb is not defined; exceptions caused by the inability to automatically process fasta files with a length exceeding 1000, etc.

Overall, pLM_BLAST is a very good job, correcting the above issues will hopefully improve the user-friendliness of plm_blast and make it even better. Thank you all again and have a nice day.:relaxed:

Argusmocny commented 1 year ago

Hi, @maovshao thanks for your feedback and involvement :). All bugs should be fixed for now. Comments:

the batch size -1 was misleading it should be set 0 for which embeddings.py uses adaptative batch size to speed embedding calculations. The description of embeddings.py now shows correct help message
chunk_cosine_similarity should now set kernel_size by its own - kernel must be >= then shortest sequence

I will leave this issue open for a while if you have any suggestions or questions related to this fill free to ask. :) @ @

maovshao commented 1 year ago

Thanks again for your quick response as always. I think this work is getting better. ☺️

labstructbioinf / pLM-BLAST

Errors on creating a target dataset and searching against it. #9

1. Use in Python---Simple example

2. Searching a database