timbitz / Whippet.jl

Lightweight and Fast; RNA-seq quantification at the event-level
MIT License
105 stars 21 forks source link

Complexity Nomenclature change and Shannon's Entropy #24

Closed timbitz closed 8 years ago

timbitz commented 8 years ago

I propose to use the character 'V' in front of complexity values, such that we can describe complexity as the log2 of the number of spliced Variants in an event (V1, V2, V3). @weatheritt2 and @kcha is this OK with you?

Additionally I will add a second complexity column as the Entropy of the event.

timbitz commented 8 years ago

Alternatively, 'N' makes sense as well, as the number of exons in theory increases linearly.

weatheritt2 commented 8 years ago

I feel both are overly obtuse. Variants make me think of snps and "N" makes me think of "NA".....

timbitz commented 8 years ago

What about {S1, S2, S3} for Splicing Complexity 1, 2, 3? Alternatively we could change from discrete labels into floating point values as the actual log2( number of isoforms per event ) and just have the column header say Complexity.

kcha commented 8 years ago

I think I would prefer discrete labels for simplicity. But I get the desire for more specific continuous value.

On Aug 12, 2016 5:35 PM, "Tim Sterne-Weiler" notifications@github.com wrote:

What about {S1, S2, S3} for Splicing Complexity 1, 2, 3? Alternatively we could change from discrete labels into floating point values as the actual log2( number of isoforms per event ) and just have the column header say Complexity.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/timbitz/Whippet/issues/24#issuecomment-239565644, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL2BbJ0Y5b2VxG8WfaCcXuLlqymej0tks5qfOcxgaJpZM4Jjcin .

timbitz commented 8 years ago

OK @kcha discrete it is. @weatheritt2 what about K for minimum number of nodes necessary to express that many splice variants (I actually like this one a lot)... {K1,K2,K3}

timbitz commented 8 years ago

As of 058be3a, K is used as the complexity char, and entropy in bits follows as a Float64. The psi output file is ordered like this:

Gene    Node    Coord   Strand  Type    Complexity  Entropy Psi CI_Width    CI_Lo,Hi    Total_Reads Inc_Paths   Exc_Paths
one 2   chr0:21-30  +   RI  K2  1.95    0.3077  0.495   0.123,0.618 7.0 IntSet([1, 2])  IntSet([1, 3]),IntSet([1, 5, 6]),IntSet([1, 7])
one 3   chr0:31-40  +   CE  K2  1.53    0.4444  0.58    0.177,0.757 5.0 IntSet([1, 3])  IntSet([1, 5, 6]),IntSet([1, 7])
one 4   chr0:51-53  +   AA  K2  1.53    0.2222  0.5715  0.0665,0.638    4.0 IntSet([4, 5, 6])   IntSet([1, 5, 6]),IntSet([1, 7])
one 5   chr0:54-62  +   CE  K2  1.53    0.6666  0.563   0.317,0.88  5.0 IntSet([1, 5, 6]),IntSet([4, 5, 6]) IntSet([1, 7])
one 6   chr0:63-65  +   AD  K2  1.53    0.6666  0.717   0.199,0.916 2.0 IntSet([1, 5, 6]),IntSet([4, 5, 6]) IntSet([1, 7])
one 7   chr0:76-85  +   TE  NA  NA  NA  NA  NA  NA  NA  NA
one 8   chr0:86-90  +   TE  NA  NA  NA  NA  NA  NA  NA  NA
timbitz commented 8 years ago

Unless there are any objections to this format I am going to close this and merge/release v0.3 I think

weatheritt2 commented 8 years ago

why do you think stating it in terms of minimum number of nodes would be less esoteric? saying that as long as clearly written in manual/help, it should be fine

timbitz commented 8 years ago

Yah it is a little esoteric still, but so is 'C' for complexity, which could mean a bunch of different things, and most likely already comes wrapped with some preconceived (and most likely incorrect) notion of what it is or should be. Also the 'K' values whenever they are introduced will be matched over those little diagrams that are very straightforward.

@weatheritt2 is the order of columns acceptable, I would like to make any changes to the format of the .psi files sooner rather than later.

weatheritt2 commented 8 years ago

I'd move columns related PSI more to the left, as will be most familiar to people. Otherwise, looks fine

timbitz commented 8 years ago

OK columns moved, and diff updated. Gene Node Coord Strand Type Psi CI_Width CI_Lo,Hi Total_Reads Complexity Entropy Inc_Paths Exc_Paths is the order now. I am going to add some docs and merge.