loosolab / UROPA

Universal RObust Peak Annotator
https://uropa-manual.readthedocs.io/
MIT License
15 stars 6 forks source link

peak annotation towards intergenic and intron #20

Open deep-buddingcoder opened 1 year ago

deep-buddingcoder commented 1 year ago

Hi,

This is not a technical issue but a conceptual question.

Till date, I have been using HOMER for ChIP-seq and ATAC-seq peak annotation. As I intend to perform TOBIAS DNA foot printing analysis, I have decided to generate peak annotation file (required for TOBIAS BINDetect) using UROPA. In this regard, I plan to use NCBI RefSeq GTF file (which is also used by HOMER). Given is the weblink: http://hgdownload.soe.ucsc.edu/goldenPath/archive/hg38/ncbiRefSeq/000001405.40-RS_2023_03/

The col 3 of GTF does not contain any information about promoter, intron, intergenic or intergenic_CNS (conserved non-coding sequence).

Yet, HOMER manages to find detail annotation for promoter, intron, intergenic regions, different types of RNA etc.

Will UROPA also perform detail annotation using NCBI RefSeq or am I working with a incorrect GTF file?

Thanks in anticipation for your help.

msbentsen commented 1 year ago

Hi @deep-buddingcoder,

The .gtf-file you are using looks fine. UROPA does not automatically find the details about promoter etc., but you can set up specific queries for that, for example as seen in the example config file here: https://github.com/loosolab/UROPA/blob/master/sample_config.json. The example shows promoters, forward exons or levels, but you can put introns, intergenic etc. as well depending on the setup: image

In that way, the promoter-information is not given in the third column, but you set it yourself in the UROPA run. Hope that makes sense.

deep-buddingcoder commented 1 year ago

Thanks for the suggestion. I definitely missed this bit of information about config file structure. I will work on it and then update the status of this issue.

samuelruizperez commented 1 year ago

For introns and intergenic regions, you could also first run AGAT's:

agat_sp_add_introns.pl \
    -f hg38.000001405.40-RS_2023_03.ncbiRefSeq.gtf \
    --out hg38.000001405.40-RS_2023_03.ncbiRefSeq.wIntrons.gff3

agat_sp_add_intergenic_regions.pl \
    -f hg38.000001405.40-RS_2023_03.ncbiRefSeq.wIntrons.gff3 \
    --out hg38.000001405.40-RS_2023_03.ncbiRefSeq.wIntrons.wIntergenic.gff3

# Merge main annotation with other features (promoter, enhancer, RNAs annotations, etc.)
agat_sp_merge_annotations.pl \
    -f hg38.000001405.40-RS_2023_03.ncbiRefSeq.wIntrons.wIntergenic.gff3 \
    -f hg38.enhancers.gtf \
    -f hg38.rnas.gff \
    --out hg38.merged.gff3

agat_convert_sp_gff2gtf.pl \
    --gff hg38.merged.gff3 \
    --gtf_version relax \
    --out hg38.merged.gtf
grep -v "^#" hg38.merged.gtf | sort -k1,1 -k4,4n \
    > hg38.merged.sorted.gtf

And then use intron and intergenic_region (or other merged features) as independent features in the uropa_config.json file:

{
    "queries":[
        {"name": "inferred_TSS_promoter", "feature":"gene", "feature.anchor": "start", "distance":[1000,100], "internals":"True", "direction":"upstream"},
        {"name": "inferred_TTS", "feature":"gene", "feature.anchor": "end", "distance":[100,1000], "internals":"True", "direction":"downstream"},
        {"name": "cds", "feature":"CDS", "distance":[1,1], "internals":"True"},
        {"name": "five_prime_utr", "feature":"five_prime_UTR", "distance":[1,1], "internals":"True"},
        {"name": "three_prime_utr", "feature":"three_prime_UTR", "distance":[1,1], "internals":"True"},
        {"name": "exonic", "feature":"exon", "distance":[1,1], "internals":"True"},
        {"name": "intronic", "feature":"intron", "distance":[1,1], "internals":"True"},
        {"name": "intergenic", "feature":"intergenic_region", "distance":[1,1], "internals":"True"}
    ],
    "priority": "True",
    "gtf": "hg38.merged.sorted.gtf",
    "bed": "your.bed"
}