qiime2 / q2-types

BSD 3-Clause "New" or "Revised" License
17 stars 41 forks source link

ENH: add `ProfileHMM[*]` semantic types #328

Closed Sann5 closed 5 months ago

Sann5 commented 6 months ago

Closes #327.

Adds new semantic types for profile hidden markov models as implemented in the HMMER + tests and test data.

gregcaporaso commented 6 months ago

@misialq, would you be able to review this one?

Update: @colinvwood is going to try to take a pass through this and merge today, so it's in the prepare this week.

colinvwood commented 6 months ago

Hey @misialq, I didn't have time to look at this today. If you have time tomorrow to look at this then just let me know otherwise I'll plan to look at it tomorrow. Excuse all the pings 🥸

misialq commented 6 months ago

Hey @gregcaporaso, @colinvwood - sure thing, I already had a glance - there are some significant changes which I proposed to @Sann5 so please do not review yet - the contents will likely change. We'll ping you when ready, thanks! 🙏

gregcaporaso commented 6 months ago

Good to know, thanks @misialq. I converted this to a Draft pull request. Since we have the release next week, I'm going to bump this to the project board for the next release - let us know if it'll be an issue to not have this in 2024.5.

misialq commented 6 months ago

Hey @gregcaporaso, thanks! No, I don't think it's an issue if we don't have it in 2024.5. We will probably want to test it out a bit together with our new moshpit action for eggnog so we may need some more time anyway :)

misialq commented 5 months ago

Hey @Sann5, what's up with the two failing tests?

Sann5 commented 5 months ago

Hey @Sann5, what's up with the two failing tests?

@misialq I opened an issue in phammer complaining how the error thrown when loading a file with mixed profiles (DNA, RNA, Protein) was uninformative. They already fixed it and pushed the patch to conda. I will update the error handling accordingly here.

Sann5 commented 5 months ago

Hey @Sann5, LGTM, thanks! If it's not too much trouble, do you think you could attach here this nice table you presented once in our meeting - it may be helpful in understanding what all the formats do 🙏

@lizgehret do you think you could check this out? :)

Sure thing!

Profile HMM's

How are they used

The way they are usually used is:

  1. You take a group of sequences that are known to be related (i.e. a protein family)
  2. You build a profile HMM from the alignment of these sequences (called a seed alignment).
  3. You use the profile HMM to estimate the probability that a sequence "belongs" to this family (i.e. search for homologs).

One can also use profile HMMs to do sequence annotation or alignment.

How are they stored

Profile HMMs are different for different sequence types (e.g. DNA, RNA, and protein). Moreover, HMMER, the go-to software for biological sequence analysis with profile HMMs, saves profiles as text (or binary) files. One file can contain one or more profiles, each representing a group of sequences. However, no valid file can have profiles from more than one sequence type. Files with multiple profiles will be used to run some programs in HMMER while files with a single profile can run other programs.

The proposal

To accommodate the different things that these profiles represent as well as the future use cases, this PR proposed the following semantic types.

Protein DNA RNA
Single Profile ProfileHMM[SingleProtein] ProfileHMM[SingleDNA] ProfileHMM[SingleRNA]
Multiple Profiles ProfileHMM[MultipleProtein] ProfileHMM[MultipleDNA] ProfileHMM[MultipleRNA]
Multiple Profiles in Binary + Indexed ProfileHMM[PressedProtein] ProfileHMM[PressedDNA] ProfileHMM[PressedRNA]