kordk / torch-ecpg

(GPU accelerated) eCpG mapper
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Update Cis and Distal Region Analysis Logic #37

Open liamgd opened 8 months ago

liamgd commented 8 months ago

The current implementation of all and trans ranges does not need modification.

The logic for cis and distal regions are the same. The analyses utilize the parameters window_base, downstream, and upstream, all measured in bases. The two region modes only differ by the default values for these specifications. The strand information in the gene annotation file is used to determine which direction to orient the methylation region.

Let $G$ be the transcript start site of the gene, $M$ be the starting position of the methylation locus, $b$ be the window base, $u$ be the upstream bases, and $d$ be the downstream bases.

For "+" strands, the regression is included in the analysis if $G > M + b - u$ and $G < M + b + d$ are both true, and $G$ and $M$ lie on the same chromosome. For "-" strands, it is included if $G < M - b + u$ and $G > M - b - d$ are both true and again, they lie on the same chromosome. We can subtract $M$ from each side to obtain a new set of four comparisons. Then, for the "-" strand, we replace the following: $-b-d=-(b+d)$ and $-b+u=-(b-u)$. Now, the "+" and "-" strand comparisons only differ by a factor of $-1$.

Now, we can let $s$ be $1$ for the "+" strand and $-1$ for the minus strand, and we have this logic: the regression is included if $s(b - u) < G - M$ and $G - M < s(b + d)$ and $G$ and $M$ are on the same chromosome.

liamgd commented 8 months ago

The fastest implementation of this logic that I have found so far is:

region_indices = (
    (G_chrom_t[index - 1, None] == M_chrom_t)
    .logical_and(
        G_strand_t[index - 1, None]
        * (window_base - upstream)
        < G_pos_t[index - 1, None] - M_pos_t
    )
    .logical_and(
        G_pos_t[index - 1, None] - M_pos_t
        < (
            G_strand_t[index - 1, None]
            * (window_base + downstream)
        )
    )
)

The tensor indexing [index - 1, None] leads to the tensor's shape being suitable for broadcasting to the shape of the M tensors.

liamgd commented 8 months ago

Implemented in bbdcbee and pending testing.

kordk commented 5 months ago

Thanks for the update. Please proceed with testing.