rcsb / py-rcsb_db

RCSB Python Database Utility Classes
Other
11 stars 5 forks source link

Bringing in the mmseqs clustering workflow procedure as a CLI - RO-4226 #69

Closed josemduarte closed 7 months ago

josemduarte commented 8 months ago

@piehld this is my initial attempt. Note this requires changes in rcsb.utils.seqalign that I'll PR separately.

Right now, I'd appreciate some early feedback about this.

For instance some questions:

Questions for later:

piehld commented 8 months ago

Thanks @josemduarte! Here are my immediate responses to your questions:

variable camel case vs snake case: my IDE has PEP8 which requires snake case, but I see camel case already there. Is there strong feelings about this?

Yes, in all of the ExDB code we expect camelCasing (specifically, all classes should start with a capital letter and everything else starts with a lowercase letter). We have pylinting and formatting setup to check for that.

headers and other ancillary things. I copy pasted the headers from another file. Should I follow some more strict guidelines?

The headers are less important, but looks like you followed the standard so I think it's good.

version numbers: are they set manually? or is it part of the CI pipeline

Yes they need to be updated manually. There are two files for this:

You'll need to do the same for the py-rcsb_utils_seqalign package.

how to test this? The issue is it depends on an unreleased rcsb.utils.seqalign package. Any tricks for that?

There are two ways to test your updates—manually (locally) or automatically on Azure. Testing them locally allows you to test multiple development packages at once, without them needing to be released. You can do so by installing each package locally using the editable mode (i.e., go into the package directory and type pip install -e .). Then, you can either go into the test directory and run the test scripts directly, or by running tox (see the note at https://github.com/rcsb/py-rcsb_db/blob/master/README.md#installation).

To be honest, I sometimes have difficulty running the test suite for py-rcsb_db locally, as it requires several fixture scripts to be run first, which kind of set things up (e.g., setting up a fake mongo instance, etc.). It can certainly be done, but just a warning you might have to do a bit of work to get there.

To get Azure to test things automatically, you simply need to create a PR as you've already done here. In fact, you can already see the results of your builds here: https://dev.azure.com/rcsb/RCSB%20PDB%20Python%20Projects/_build

Generally, for your situation, if it's possible to complete the seqalign package work separately first (including adding tests suites to confirm it does what you want it to do), then I'd recommend doing that first, publishing that PR when ready, and then resume work on this package. That way, Azure will pick up the latest version of seqalign to run your tests.

Otherwise, if the two are too tightly coupled, then you'll need to run the tests locally as I described above. I can also show you how to run the weekly update workflow using these code changes too, which would probably be a good idea before finalizing everything.

piehld commented 7 months ago

Hi @josemduarte, @brindakv and I were talking this morning and were considering whether this addition would better belong in py-rcsb_exdb, as you and I had contemplated last Friday. I think we should chat about this tomorrow.

Also, just FYI, it looks like there is still one remaining linting issue:

rcsb/db/cli/MmseqsSequenceClustersExec.py:26:186: E501 line too long (188 > 185 characters)
josemduarte commented 7 months ago

As mentioned in RO-4226, I'm going to close this in favor of https://github.com/rcsb/rcsb-redwood/pull/131 . The branch will still be there in case we want to come back to it.