konstin / knn-for-homology

Repository for "Nearest neighbor search on embeddings rapidly identifies distant protein relations"
11 stars 0 forks source link

Issues when trying to get this running #1

Open daphnedemekas opened 1 year ago

daphnedemekas commented 1 year ago

Hi

I really enjoyed your paper and this work. I was trying to get it running locally and I ran into a few issues.

Firstly, I would point out that I think there may be a few typos in your README First of all, under Setup, I think pip install -U git+https://github.com/konstin/protein-knn should be pip install git+https://github.com/konstin/knn-for-homology

Second, under CATH20, I think python -m cath.embed-all should be python -m cath.embed_all

Also, when trying to run it, it seems that there is a missing requirement, bio_embeddings, which is not in the pyproject.toml file. When trying to poetry add it there were a few dependency issues, namely with scikit-learn and python, so I had to change the versions to be python = ">3.8<3.10" and I had to downgrade scikit-learn.

After this, when trying to poetry add bio_embeddings I still had an issue with pytorch which I have not yet resolved:

 _WheelFileValidationError

  ["In /home/u10/daphnedemekas/.cache/pypoetry/artifacts/33/6e/26/873c11e4be52d7e66c1bd11f1e5def02721ba779e8b38968cc23186dc3/torch-1.10.0-cp39-cp39-manylinux1_x86_64.whl, hash / size of torch-1.10.0.dist-info/METADATA didn't match RECORD"]

I am wondering how you installed bio_embeddings?

Finally, I am a little bit confused as to whether you are using poetry or pip - as it seems you are using poetry, but you are running directly with python (python -m cath.embed-all) rather than poetry run python -m cath.embed-all. Is that just because you are running that command from within the poetry virtual environment?

Thanks very much!

Daphne

konstin commented 1 year ago

Thanks for trying out the code and reporting problems!

I've fixed the problems in the readme you described. Sorry about missing bio_embeddings, i was working on bio_embeddings while also working on the paper and had my dev install. Regarding poetry add bio_embeddings this was an accident in poetry 1.4.1 that has been just fixed in 1.4.2. I think pip install bio_embeddings[transformers] is the best way to install it atm, i've also added that to the readme.

Finally, I am a little bit confused as to whether you are using poetry or pip - as it seems you are using poetry, but you are running directly with python (python -m cath.embed-all ) rather than poetry run python -m cath.embed-all . Is that just because you are running that command from within the poetry virtual environment?

Yes, i normally have the environment activated in the shell i work in. I've added poetry shell to the readme to make this clearer. I generally tried to do everything through poetry but in some cases pip install makes more sense over fighting with the version resolver.

daphnedemekas commented 1 year ago

Thanks so much! That's really helpful.

I will say that for me, I had to pip install bio_embeddings[all] rather than bio_embeddings[transformers] in order to be able to import CPCProt.

I also did have one other question - what are your thoughts on being able to run this on UniRef data? I can see that for CATH for example your input is FASTA files as well as a domain list txt file - would it be possible to instead run it on fasta files from UniRef along with some labelled alignments from HMMER for instance? How difficult do you think that would be?

Thanks!

Daphne

konstin commented 1 year ago

The second benchmark i do uses Pfam data, where the sequences are also sourced from UniProt, so other UniProt subsets such as UniRef should work as well.

The main differences will be much embedding time and disk space this takes and that you'll likely want to pick other index setting (see the faiss wiki on how to pick an index). What you should consider though is that you need some kind of postprocessing step for full length sequences or you will both less sensitive and slower than just plain MMSeqs2; an embedding based alignment would be ideal, and also the other caveats documented for full length sequences apply.

If on the other hand you just want to benchmark a new protein language model, I'd try the CATH one first and only do Pfam eventually, the CATH benchmark is much more pleasant to work with (due to size and because the Pfam annotations are more complex)

daphnedemekas commented 1 year ago

Yes I see, thanks that's very helpful.

I did try to run the second benchmark starting with python -m pfam.prepare_subset10_full_sequences but I get the error No such file or directory: 'knn-for-homology/pfam/subset10/test.fasta'

Is that because I'm meant to run that on my own data, or did I miss something from where to download that dataset, or is that a bug?

One other quick question I had was whether you're aware of how I can change my default cache directory for where the bio_embeddings model files are stored, as my current cache directory is filling up but I can save it elsewhere, where I won't have the problem. No worries if you're not sure about that one.

Thanks very much

Daphne

zdk123 commented 7 months ago

@daphnedemekas It looks like this load_files function will download that file.