mortazavilab / lapa

Alternative polyadenylation detection from diverse data sources such as 3'-seq, long-read and short-reads.
https://www.biorxiv.org/content/10.1101/2022.11.08.515683v1
23 stars 13 forks source link

other PolyA motifs #14

Closed MustafaElshani closed 1 year ago

MustafaElshani commented 1 year ago

Dear @MuhammedHasan

Great work on LAPA, planning to start using it very soon. I have a question regarding the poly A motifs, as per preprint you have mentioned that LAPA looks for the canonical AATAAA, is this the only motif it searches for before determining a polyA site usage? Have you considered adding other motifs such as

aataaa
attaaa
agtaaa
tataaa
cataaa
gataaa
aatata
aataca
aataga
aaaaag
actaaa
aagaaa
aatgaa
tttaaa
aaaaca
ggggct

I have done some preliminary analysis with SQANTI3 and the distribution change between cells type from AATAAA being used 70% in one to 46% in another. The second, most used is ATTAAA. The others such as AAAAAG and AGTAAA change the most as percentages between cell types

Kind Regards Mustafa

MuhammedHasan commented 1 year ago

Hi @MustafaElshani,

Yes, LAPA looks for multiple motif and those are :

polyA_signal_seqs = [
    'AATAAA', 'ATTAAA', 'TATAAA', 'AGTAAA',
    'AATACA', 'CATAAA', 'AATATA', 'GATAAA',
    'AATGAA', 'AATAAT', 'AAGAAA', 'ACTAAA',
    'AATAGA', 'ATTACA', 'AACAAA', 'ATTATA',
    'AACAAG', 'AATAAG'
]

See https://github.com/mortazavilab/lapa/blob/master/lapa/utils/common.py

It also reports poly(A)-sites even if there is no poly(A)-signal given enough reads supporting the site.

Happy to help if you have any further questions.

MustafaElshani commented 1 year ago

That is great! thank you for the prompt reply. Couple more question I see the volcano plot reports on Δusage is that difference between distal vs proximal in 3'end usage? How would be able to find intronic polyadenylation? Is there a way to visualise IGV I guess? Does lapa determine 3'UTR length can this information be obtained from lapa analysis?

Mustafa

MuhammedHasan commented 1 year ago

Dear Mustafa,

Δusage is the difference between to conditions (myoblast and myotube in the LAPA paper)

See the output format of LAPA: https://lapa.readthedocs.io/en/latest/output.html

Output contains Feature columns which annotate if site located in UTR, exon or intron.

Also, this output file is .bed file so you can basically load it to IGV as well if you want to visualize it.

LAPA does not report 3' UTR length but report coordinates of each poly(A)-site so you can calculate it 3'UTR is elongated or shortened.

For example, you can calculate the usage change of the most distal poly(A)-site by:

from lapa.result import LapaResult

result = LapaResult('your lapa output directory')

# statistical test
# df contains Δusage between condition
df = result.fisher_exact_test({
    'condition-1': ['samples1, 'samples2'],
    'condition-2': ['samples3', 'samples4']
})

# extract coordinates from the output
chrom, pos, strand = zip(*df['polya_site'].str.split(':'))
df['Chromosome'] = chrom
df['Start'] = pos
df['Start'] = df['Start'].astype(int)
df['End'] = df['Start'] + 1
df['Strand'] = strand

# Calculate most distal poly(A)-site
df.loc[df[df['Strand'] == '+'].groupby('gene_id')['End'].idxmax(), 'distal'] = True
df.loc[df[df['Strand'] == '-'].groupby('gene_id')['Start'].idxmin(), 'distal'] = True
df['distal'] = ~df['distal'].isna()
MustafaElshani commented 1 year ago

Thank you that has been immensely helpful. I would also like to ask at what point would I be able to get gene_symbol/gene_name so that in the volcano plot I can show those rather gene_id. Is this in the output or should I just get this in bioMart or equivalent

Mustafa

MuhammedHasan commented 1 year ago

Please use your favored tool (bioMart .etc) to map gene_id to gene_name. LAPA reports gene_id, not gene_name.

baishengjun commented 10 months ago

Dear @MuhammedHasan

I am confused with the lapa output. How to determine the distal and proximal poly(A) sites ? In my mind, if poly(A) sites located in 3'UTR, it means proximal poly(A) sites, elsewise it is distal poly(A) sites? Another question, how to calculate 3'UTR is elongated or shortened? Does directly compare the poly(A) sites with annotated_site output from lapa?

Thanks a lot, Bai