labstructbioinf / pLM-BLAST

Detection of remote homology by comparison of protein language model representations
https://toolkit.tuebingen.mpg.de/tools/plmblast
MIT License
45 stars 5 forks source link

Sequence names not in plmblast.py output #29

Closed seanrjohnson closed 11 months ago

seanrjohnson commented 1 year ago

When I run plmblast.py (with some patches to make it not crash, see #23), The output qid and sid are numbers. It would be great if instead of numbers, they were the actual names of the sequences as taken from the original fasta files.

See this screenshot of my output, if I want to know what my top hits are, I have to look through the database fasta file and figure out what the 0th, 3234th, 2172nd, etc sequences are. Which seems inconvenient. image

Argusmocny commented 1 year ago

This will we fixed in upcoming update (this week). Thanks for pointing this out.

Argusmocny commented 11 months ago

@seanrjohnson fixed

papelypluma commented 11 months ago

Hi @Argusmocny. I only realized that the qid and sid were numbers in the output I got after running the job for almost 2 weeks. I understand that there has been a fix/enhancement to this, but to come up with a quick solution on my end (without re-running the job) I'd like to ask whether the qid's and sid's numbers are 0-based indices following the order they appear in the fasta file I used to build the query and the reference database.

Thanks in advance!

Edit: am re-running anyway after realizing that there was sth wrong with my query input earlier. In retrospect, probably the default workers (=10) was one of the reasons why it took long to complete the job. now running with the recent git commit

Argusmocny commented 11 months ago

yes, sid and qid are just position in fasta/csv input file starting from zero