kuleshov-group / caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Apache License 2.0
148 stars 23 forks source link

Refactor/create package #1

Open jorenretel opened 7 months ago

jorenretel commented 7 months ago

Hi Yair,

very nice work. I have a suggestion that I wrapped here in this pull request. By a little bit of restructuring and adding setup.cfg and pyproject.toml files, the code now becomes a package that can be installed like:

pip install -e .

(when located in the repo root). This makes the namespacing a bit easier and prevents needing that manually setting the src dir on the PYTHONPATH that you are doing here:

https://github.com/kuleshov-group/caduceus/blob/5ba5bfdd1e34004c450176f7c7143b47a3ce659b/setup_env.sh#L18

I also put the "caduceus" HuggingFace code inside that package so that there is no name clash between the new "caduceus" python package and the "caduceus" directory.

This is just a suggestion. Hope you will consider it.

Best,

Joren

yair-schiff commented 6 months ago

Hi @jorenretel, thank you very much for this suggestion. I appreciate the pr. Were you able to verify that the code runs as expected with the proposed changes?

jorenretel commented 6 months ago

Hi @yair-schiff, thanks for your answer. Sorry for getting back to you only now. I had a major life event happening last week, so I was kind of distracted ;-). Everything was working I think, but I had some problems installing Mamba as this was not trivial. I will merge you latest changes into my branch, check again and get back to you.

jorenretel commented 6 months ago

I figured out why my mamba-ssm install was not working. Had a bad pip cache laying around. My bad.

What I did to verify whether things are working in this refactored version is to run train.py as you indicated. After fixing a few settings, this is working:

image

The environment that I am using for this, is based on this environment.yml :

name: caduceus-env
channels:
  - conda-forge
  - pytorch
  - nvidia
dependencies:
  - python==3.10
  - pip
  - nvidia::cuda-nvcc=12.1
  - pytorch::pytorch-cuda=12.1
  - pytorch::pytorch
  - pandas
  - pip:
      - -e .

That last line installs the caduceus library in editable form in the conda environment (assuming that environment.yml is located in the root of the repository). The other pip dependencies of caduceus are then also automatically installed, because they are listed in the setup.cfg .

Note: I can also move src/caduceus dir one level up to omit one level of nested directories. It's probably a matter of taste whether to do this or not. I initially didn't because there was already a directory called "caduceus", but in the end had to move that inside the library as well because other python files were importing things from there. I would be happy to make that one more refactor to omit the "src" dir from the tree if you prefer that.