Open daphnedemekas opened 1 year ago
Thanks for trying out the code and reporting problems!
I've fixed the problems in the readme you described. Sorry about missing bio_embeddings, i was working on bio_embeddings while also working on the paper and had my dev install. Regarding poetry add bio_embeddings
this was an accident in poetry 1.4.1 that has been just fixed in 1.4.2. I think pip install bio_embeddings[transformers]
is the best way to install it atm, i've also added that to the readme.
Finally, I am a little bit confused as to whether you are using poetry or pip - as it seems you are using poetry, but you are running directly with python (python -m cath.embed-all ) rather than poetry run python -m cath.embed-all . Is that just because you are running that command from within the poetry virtual environment?
Yes, i normally have the environment activated in the shell i work in. I've added poetry shell
to the readme to make this clearer. I generally tried to do everything through poetry but in some cases pip install
makes more sense over fighting with the version resolver.
Thanks so much! That's really helpful.
I will say that for me, I had to pip install bio_embeddings[all] rather than bio_embeddings[transformers] in order to be able to import CPCProt.
I also did have one other question - what are your thoughts on being able to run this on UniRef data? I can see that for CATH for example your input is FASTA files as well as a domain list txt file - would it be possible to instead run it on fasta files from UniRef along with some labelled alignments from HMMER for instance? How difficult do you think that would be?
Thanks!
Daphne
The second benchmark i do uses Pfam data, where the sequences are also sourced from UniProt, so other UniProt subsets such as UniRef should work as well.
The main differences will be much embedding time and disk space this takes and that you'll likely want to pick other index setting (see the faiss wiki on how to pick an index). What you should consider though is that you need some kind of postprocessing step for full length sequences or you will both less sensitive and slower than just plain MMSeqs2; an embedding based alignment would be ideal, and also the other caveats documented for full length sequences apply.
If on the other hand you just want to benchmark a new protein language model, I'd try the CATH one first and only do Pfam eventually, the CATH benchmark is much more pleasant to work with (due to size and because the Pfam annotations are more complex)
Yes I see, thanks that's very helpful.
I did try to run the second benchmark starting with python -m pfam.prepare_subset10_full_sequences
but I get the error
No such file or directory: 'knn-for-homology/pfam/subset10/test.fasta'
Is that because I'm meant to run that on my own data, or did I miss something from where to download that dataset, or is that a bug?
One other quick question I had was whether you're aware of how I can change my default cache directory for where the bio_embeddings model files are stored, as my current cache directory is filling up but I can save it elsewhere, where I won't have the problem. No worries if you're not sure about that one.
Thanks very much
Daphne
Hi
I really enjoyed your paper and this work. I was trying to get it running locally and I ran into a few issues.
Firstly, I would point out that I think there may be a few typos in your README First of all, under Setup, I think
pip install -U git+https://github.com/konstin/protein-knn
should bepip install git+https://github.com/konstin/knn-for-homology
Second, under CATH20, I think
python -m cath.embed-all
should bepython -m cath.embed_all
Also, when trying to run it, it seems that there is a missing requirement,
bio_embeddings
, which is not in thepyproject.toml
file. When trying topoetry add
it there were a few dependency issues, namely withscikit-learn
andpython
, so I had to change the versions to bepython = ">3.8<3.10"
and I had to downgradescikit-learn
.After this, when trying to
poetry add bio_embeddings
I still had an issue with pytorch which I have not yet resolved:I am wondering how you installed
bio_embeddings
?Finally, I am a little bit confused as to whether you are using poetry or pip - as it seems you are using poetry, but you are running directly with python (
python -m cath.embed-all
) rather thanpoetry run python -m cath.embed-all
. Is that just because you are running that command from within the poetry virtual environment?Thanks very much!
Daphne