issue with dependencies

jahnl commented 2 years ago

Hello, I got access to this repository for my master practicum at Rostlab. I just wanted to mention that I had some issues with this project's dependencies. (Occurred in both Windows and Linux.) For example:

installing gensim was only possible on python 3.8 (https://stackoverflow.com/questions/66958119/error-when-installing-gensim-using-pip-install)
There were also problems with the bioembeddings extras:

"bio_embeddings.utilities.exceptions.InvalidParameterError: The extra for the protocol prottrans_t5_xl_u50 is missing. See https://docs.bioembeddings.com/#installation on how to install all extras" So I went over to the bioembeddings repo, and followed the instructions. (pip install bio-embeddings[all]) However, I got the warning that "bio-embeddings 0.1.3 does not provide the extra 'all'", and afterwards I still got the InvalidParameterError from before.

Since the provided biotrainer configurations aren't that compatible with my ML task anyway, I decided to implement my own ML model, but for future users maybe consider to provide a Docker image or something similar :)

sacdallago commented 2 years ago

HI, I'm not aware of any dependencies currently being explicitly mentioned (in like a requirements.txt file or setup.py), as we first wanted to build out the core functionality. You however raise a good point. But let's go step-by-step:

So I went over to the bioembeddings repo, and followed the instructions. (pip install bio-embeddings[all]) However, I got the warning that "bio-embeddings 0.1.3 does not provide the extra 'all'", and afterwards I still got the InvalidParameterError from before.

This is super odd. How did you end up installing v0.1.3?! On pipy the current version is 0.2.2, and if you install from GH you should get to v0.2.3 . Like: I don't see how you could have gotten v0.1.3 :D

Since the provided biotrainer configurations aren't that compatible with my ML task anyway, I decided to implement my own ML model

Could you describe your task? :) and/or model. Still building things out, makes sense to include whatever is needed by the lab ;)

but for future users maybe consider to provide a Docker image or something similar :)

For sure :) but as mentioned, we don't even specify requirements explicitly yet -- there's some way to go from current place to Dockerfile

@SebieF Are you busy? ;) I think this issue raises an important point on reproducibility. If you don't have any other tasks you'd like to be working on, I'd suggest to fix the requirements and create a .toml file with poetry the way it's done on bio_embeddings (useful link: https://python-poetry.org/docs/basic-usage/#initialising-a-pre-existing-project). This will also mean to get rid of the setup.py, file which was anyway a placeholder.

I'd create extras around bio-embeddings (basically, pip install biotrainer[XXXX]):

prottrans --> bio-embeddings[prottrans]
esm --> bio-embeddings[esm]

if someone wants to run other bio-embeddings models, they'll have to manually install the neccessary dependencies. I would not create an all option, this is giving more headaches than it is solving problems.

Finally, if you get around all of this (sounds like a lot, but it shouldn't be that major), you could consider setting up a Dockerfile, and again, here you can pretty much get inspired by bio-embeddings: https://github.com/sacdallago/bio_embeddings/blob/develop/Dockerfile

I think I need to start thinking about CI...

sacdallago commented 2 years ago

Oh! And we should only support python 3.8 and 3.9!

jahnl commented 2 years ago

Hi,

This is super odd. How did you end up installing v0.1.3?! On pipy the current version is 0.2.2, and if you install from GH you should get to v0.2.3 . Like: I don't see how you could have gotten v0.1.3 :D

Apparently the message doesn't work the way it should. I looked up the version in the METADATA file and its 0.2.2, as it should be :)

Could you describe your task? :) and/or model. Still building things out, makes sense to include whatever is needed by the lab ;)

Sure. The task is to predict binding residues in disordered regions of proteins, and the binding partners' classes. Since there may be several classes (e.g. both other proteins and nucleic acids), the model should implement multi-label classification or one has to transform the problem. I already have the protein embeddings, so it would be great to have the option to provide your own embeddings to biotrainer as well. Apart from the embeddings themselves I will (probably) input the information, which residues are in disordered regions. So maybe it would be useful to have the option to provide some confounding features, too.

For sure :) but as mentioned, we don't even specify requirements explicitly yet -- there's some way to go from current place to Dockerfile

Great 👍

SebieF commented 2 years ago

Oh! And we should only support python 3.8 and 3.9!

That will be problematic because bio_embeddings also supports python 3.7.1 and we depend on bio-embeddings. Anyways, I'm looking into the dependency management this week :)

SebieF commented 2 years ago

I already have the protein embeddings, so it would be great to have the option to provide your own embeddings to biotrainer as well.

@jahnl The option already exists :) Just specify the path to your embeddings in the config like this: embeddings_file_path: /path/to/embeddings.h5 # optional, if defined will use 'embedder_name' to name experiment

Apart from the embeddings themselves I will (probably) input the information, which residues are in disordered regions. So maybe it would be useful to have the option to provide some confounding features, too.

That's a very interesting feature, but we will have to evaluate if we can support it (handling arbitrary input will not be easy). Thanks for your early feedback, please always feel free to create an issue or even write an E-Mail (sebastian.franz@tum.de), if you have any questions or problems!

jahnl commented 2 years ago

The option already exists :) Just specify the path to your embeddings in the config like this: embeddings_file_path: /path/to/embeddings.h5 # optional, if defined will use 'embedder_name' to name experiment

I see, thanks for pointing it out :)

sacdallago commented 2 years ago

Apart from the embeddings themselves I will (probably) input the information, which residues are in disordered regions. So maybe it would be useful to have the option to provide some confounding features, too.

That's a very interesting feature, but we will have to evaluate if we can support it (handling arbitrary input will not be easy). Thanks for your early feedback, please always feel free to create an issue or even write an E-Mail (sebastian.franz@tum.de), if you have any questions or problems!

A note here: you DON'T need to change models or code for that, you only need to change the embeddings file!!!! It's a much easier probelm to solve than you'd think.

Effectively, you are adding one feature to your input, so your model witll get 1024+1 features. As in the current pipeline the feature size is anyway automatically inferred by the "embedding dimension", all you need to do is to add the feature to each residue in the embedding file, or to each sequence embedding in the h5 set, depending on if you have a residueX_to_Y or sequence_to_Z problem.

concretely:

you have an embeddings h5 file contining residue embeddings of size D for one sequence of length L (file[sequence].shape = (LxD))
you have some other file collecting 5 additional per-residue features for L (file[sequence].shape = (Lx5))
You open both files in python: for each key [dataset] in the h5 file, find the corresponding key in the additional featues dataset and concatenate the two --> new_embedding_file.shape = (Lx(D+5))

jahnl commented 2 years ago

Effectively, you are adding one feature to your input, so your model witll get 1024+1 features. As in the current pipeline the feature size is anyway automatically inferred by the "embedding dimension", all you need to do is to add the feature to each residue in the embedding file, or to each sequence embedding in the h5 set, depending on if you have a residueX_to_Y or sequence_to_Z problem.

Yes, that is exactly what I did when implementing my own model yesterday :) But thanks for your opinion and the detailed explanation.

sacdallago commented 2 years ago

Excellent :) if you are able to get this running with biotrainer, it would be great to have your project as a “complex” example — linked somewhere in the readme or in the examples folder, either way :)

sacdallago / biotrainer

issue with dependencies #9