scikit-bio / scikit-bio

scikit-bio: a community-driven Python library for bioinformatics, providing versatile data structures, algorithms and educational resources.
https://scikit.bio
BSD 3-Clause "New" or "Revised" License
892 stars 268 forks source link

add RNA (secondary) structure related functionalities #456

Open RNAer opened 10 years ago

RNAer commented 10 years ago

I will need these soon. I can check the cogent code and add structure classes/functions into skbio. do we want it as a module under skbio or something else?

gregcaporaso commented 10 years ago

Which specific classes/functions are you thinking of from cogent?

RNAer commented 10 years ago

I just checked cogent/struct/rna2d.py, not exactly what I think of.

I am thinking of a RNAStructure class that has RNASequence and a structure string as variable members, and has functions of dissecting structures, calcuating deltaG, and more.

@squirrelo, what do you think?

On Mon, Jun 16, 2014 at 8:20 PM, Greg Caporaso notifications@github.com wrote:

Which specific classes/functions are you thinking of from cogent?

— Reply to this email directly or view it on GitHub https://github.com/biocore/scikit-bio/issues/456#issuecomment-46260669.

jairideout commented 10 years ago

What about adding the structure info to RNASequence (making it optional of course)?

gregcaporaso commented 10 years ago

Or making a StructuredRNASequence object?

On Tue, Jun 17, 2014 at 6:31 AM, Jai Ram Rideout notifications@github.com wrote:

What about adding the structure info to RNASequence (making it optional of course)?

Reply to this email directly or view it on GitHub https://github.com/biocore/scikit-bio/issues/456#issuecomment-46306517.

squirrelo commented 10 years ago

I like the idea of a StructuredRNASequence object, although I'm not sure about the dG calculations or anything that would be done by an outside source, e.g. vienna, being part of the actual object and not a separate wrapper. I've already come across the need to do tree edit distance and things like that between RNA structures, so if we do have this object and can implement things like digesting the structure to tree form, or other completely python manipulations, that could be cool.

rob-knight commented 10 years ago

There are cases where you want the structure independent of a sequence eg for counting distinct structures or for designing a sequence to fit a specified structure. Also remember that 1 sequence can have many structures. In fact one of the key flaws of rnaml was nesting structure within sequence: you could not have a table of data for x sequences by y shared structures without repeating the structure data for each sequence.

Rob

On Jun 17, 2014, at 7:42 AM, "Greg Caporaso" notifications@github.com<mailto:notifications@github.com> wrote:

Or making a StructuredRNASequence object?

On Tue, Jun 17, 2014 at 6:31 AM, Jai Ram Rideout notifications@github.com<mailto:notifications@github.com> wrote:

What about adding the structure info to RNASequence (making it optional of course)?

Reply to this email directly or view it on GitHub https://github.com/biocore/scikit-bio/issues/456#issuecomment-46306517.

— Reply to this email directly or view it on GitHubhttps://github.com/biocore/scikit-bio/issues/456#issuecomment-46307862.

gregcaporaso commented 10 years ago

Ok, thanks Rob, that's a really good point. So that argues for a separate class that has an optional list/array of RNASequence(s). The assumption you'd make about an instance of this object is that the one or more structures are associated with the one or more sequences described by the instance. Is that right?

On Tue, Jun 17, 2014 at 7:32 AM, Rob Knight notifications@github.com wrote:

There are cases where you want the structure independent of a sequence eg for counting distinct structures or for designing a sequence to fit a specified structure. Also remember that 1 sequence can have many structures. In fact one of the key flaws of rnaml was nesting structure within sequence: you could not have a table of data for x sequences by y shared structures without repeating the structure data for each sequence.

Rob

On Jun 17, 2014, at 7:42 AM, "Greg Caporaso" <notifications@github.com mailto:notifications@github.com> wrote:

Or making a StructuredRNASequence object?

On Tue, Jun 17, 2014 at 6:31 AM, Jai Ram Rideout <notifications@github.com mailto:notifications@github.com> wrote:

What about adding the structure info to RNASequence (making it optional of course)?

Reply to this email directly or view it on GitHub https://github.com/biocore/scikit-bio/issues/456#issuecomment-46306517.

Reply to this email directly or view it on GitHub< https://github.com/biocore/scikit-bio/issues/456#issuecomment-46307862>.

Reply to this email directly or view it on GitHub https://github.com/biocore/scikit-bio/issues/456#issuecomment-46314487.

rob-knight commented 10 years ago

Yes, see the BayesFold code (which the pycogent code is based on) for examples.

On Jun 17, 2014, at 8:41 AM, "Greg Caporaso" notifications@github.com<mailto:notifications@github.com> wrote:

Ok, thanks Rob, that's a really good point. So that argues for a separate class that has an optional list/array of RNASequence(s). The assumption you'd make about an instance of this object is that the one or more structures are associated with the one or more sequences described by the instance. Is that right?

On Tue, Jun 17, 2014 at 7:32 AM, Rob Knight notifications@github.com<mailto:notifications@github.com> wrote:

There are cases where you want the structure independent of a sequence eg for counting distinct structures or for designing a sequence to fit a specified structure. Also remember that 1 sequence can have many structures. In fact one of the key flaws of rnaml was nesting structure within sequence: you could not have a table of data for x sequences by y shared structures without repeating the structure data for each sequence.

Rob

On Jun 17, 2014, at 7:42 AM, "Greg Caporaso" notifications@github.com<mailto:notifications@github.com mailto:notifications@github.com> wrote:

Or making a StructuredRNASequence object?

On Tue, Jun 17, 2014 at 6:31 AM, Jai Ram Rideout notifications@github.com<mailto:notifications@github.com mailto:notifications@github.com> wrote:

What about adding the structure info to RNASequence (making it optional of course)?

Reply to this email directly or view it on GitHub https://github.com/biocore/scikit-bio/issues/456#issuecomment-46306517.

Reply to this email directly or view it on GitHub< https://github.com/biocore/scikit-bio/issues/456#issuecomment-46307862>.

Reply to this email directly or view it on GitHub https://github.com/biocore/scikit-bio/issues/456#issuecomment-46314487.

— Reply to this email directly or view it on GitHubhttps://github.com/biocore/scikit-bio/issues/456#issuecomment-46315856.

jairideout commented 10 years ago

Thanks @rob-knight @squirrelo for these details! Agree that a separate class makes sense.

So that argues for a separate class that has an optional list/array of RNASequence(s)

Could this be a SequenceCollection of RNASequence(s)? AFAIK SequenceCollections are allowed to be empty.

gregcaporaso commented 10 years ago

Yes, I think we'd want that to be a SequenceCollection.

On Tue, Jun 17, 2014 at 8:41 AM, Jai Ram Rideout notifications@github.com wrote:

Thanks @rob-knight https://github.com/rob-knight @squirrelo https://github.com/squirrelo for these details! Agree that a separate class makes sense.

So that argues for a separate class that has an optional list/array of RNASequence(s)

Could this be a SequenceCollection of RNASequence(s)? AFAIK SequenceCollections are allowed to be empty.

Reply to this email directly or view it on GitHub https://github.com/biocore/scikit-bio/issues/456#issuecomment-46324753.

RNAer commented 10 years ago

good point. it is many-to-many relationship. One seq can have multiple structures and a structure can be folded from multiple sequences.

So how about a base structure class (which is corresponding to BiologicalSequence class), only containing one structure and start from there?

Another question is what the format should we use to represent secondary structure, dot-bracket, ct, etc? I really want psudoknot support.

On Tue, Jun 17, 2014 at 9:51 AM, Greg Caporaso notifications@github.com wrote:

Yes, I think we'd want that to be a SequenceCollection.

On Tue, Jun 17, 2014 at 8:41 AM, Jai Ram Rideout <notifications@github.com

wrote:

Thanks @rob-knight https://github.com/rob-knight @squirrelo https://github.com/squirrelo for these details! Agree that a separate

class makes sense.

So that argues for a separate class that has an optional list/array of RNASequence(s)

Could this be a SequenceCollection of RNASequence(s)? AFAIK SequenceCollections are allowed to be empty.

Reply to this email directly or view it on GitHub <https://github.com/biocore/scikit-bio/issues/456#issuecomment-46324753 .

— Reply to this email directly or view it on GitHub https://github.com/biocore/scikit-bio/issues/456#issuecomment-46326162.

rob-knight commented 10 years ago

These decisions are already explored exhaustively in the cogent versions. If you want arbitrary pseudo knot support you need to do it with a connect list. I have cc:ed Sandra who may be willing to talk you through some of the design decisions (and who used this code to work on pseudo knots previously).

Rob

On Jun 17, 2014, at 11:27 AM, Zech Xu notifications@github.com<mailto:notifications@github.com> wrote:

good point. it is many-to-many relationship. One seq can have multiple structures and a structure can be folded from multiple sequences.

So how about a base structure class (which is corresponding to BiologicalSequence class), only containing one structure and start from there?

Another question is what the format should we use to represent secondary structure, dot-bracket, ct, etc? I really want psudoknot support.

On Tue, Jun 17, 2014 at 9:51 AM, Greg Caporaso notifications@github.com<mailto:notifications@github.com> wrote:

Yes, I think we'd want that to be a SequenceCollection.

On Tue, Jun 17, 2014 at 8:41 AM, Jai Ram Rideout notifications@github.com<mailto:notifications@github.com

wrote:

Thanks @rob-knight https://github.com/rob-knight @squirrelo https://github.com/squirrelo for these details! Agree that a separate

class makes sense.

So that argues for a separate class that has an optional list/array of RNASequence(s)

Could this be a SequenceCollection of RNASequence(s)? AFAIK SequenceCollections are allowed to be empty.

Reply to this email directly or view it on GitHub <https://github.com/biocore/scikit-bio/issues/456#issuecomment-46324753 .

— Reply to this email directly or view it on GitHub https://github.com/biocore/scikit-bio/issues/456#issuecomment-46326162.

— Reply to this email directly or view it on GitHubhttps://github.com/biocore/scikit-bio/issues/456#issuecomment-46338636.