simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

how to merge exons parts to actual exons? #47

Closed komalsrathi closed 6 years ago

komalsrathi commented 6 years ago

Hi,

I have a bunch of samples from various tumor types in which I want to get the exon level expression of the gene IDO1. My aim is not to do any differential expression of any sort because there are no control samples. I prepared the gencode v23 gtf to gff using the following command:

python2.7 dexseq_prepare_annotation.py gencode.v23.annotation.gtf gencode.v23.annotation.gff

This gave me 33 exonic parts for the gene IDO1:

chr8    dexseq_prepare_annotation.py    aggregate_gene  39902275    39928444    .   +   .   gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39902275    39902370    .   +   .   transcripts "ENST00000518804.5"; exonic_part_number "001"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39902371    39902374    .   +   .   transcripts "ENST00000518804.5+ENST00000519154.5"; exonic_part_number "002"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39902375    39902379    .   +   .   transcripts "ENST00000518804.5+ENST00000522495.5+ENST00000519154.5"; exonic_part_number "003"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39902380    39902448    .   +   .   transcripts "ENST00000518804.5+ENST00000522840.1+ENST00000522495.5+ENST00000519154.5"; exonic_part_number "004"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39908989    39909030    .   +   .   transcripts "ENST00000522840.1"; exonic_part_number "005"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39909179    39909348    .   +   .   transcripts "ENST00000518804.5+ENST00000522840.1"; exonic_part_number "006"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39913001    39913148    .   +   .   transcripts "ENST00000522495.5"; exonic_part_number "007"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39913284    39913402    .   +   .   transcripts "ENST00000518237.5"; exonic_part_number "008"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39913403    39913773    .   +   .   transcripts "ENST00000518237.5+ENST00000519154.5"; exonic_part_number "009"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39913774    39913868    .   +   .   transcripts "ENST00000518237.5+ENST00000522840.1+ENST00000519154.5"; exonic_part_number "010"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39913869    39913888    .   +   .   transcripts "ENST00000518237.5+ENST00000522840.1+ENST00000253513.11+ENST00000519154.5"; exonic_part_number "011"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39913889    39914009    .   +   .   transcripts "ENST00000518804.5+ENST00000522495.5+ENST00000519154.5+ENST00000518237.5+ENST00000522840.1+ENST00000253513.11"; exonic_part_number "012"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39917875    39917884    .   +   .   transcripts "ENST00000518804.5+ENST00000522495.5+ENST00000519154.5+ENST00000518237.5+ENST00000522840.1+ENST00000253513.11"; exonic_part_number "013"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39917885    39917929    .   +   .   transcripts "ENST00000518237.5+ENST00000522495.5+ENST00000519154.5+ENST00000518804.5+ENST00000521636.1+ENST00000522840.1+ENST00000253513.11"; exonic_part_number "014"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39917930    39917970    .   +   .   transcripts "ENST00000518804.5+ENST00000522495.5+ENST00000519154.5+ENST00000518237.5+ENST00000521636.1+ENST00000253513.11"; exonic_part_number "015"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39918088    39918138    .   +   .   transcripts "ENST00000518804.5+ENST00000522495.5+ENST00000519154.5+ENST00000518237.5+ENST00000521636.1+ENST00000253513.11"; exonic_part_number "016"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39918139    39918207    .   +   .   transcripts "ENST00000518237.5+ENST00000521636.1+ENST00000522495.5+ENST00000519154.5+ENST00000253513.11"; exonic_part_number "017"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39918815    39918933    .   +   .   transcripts "ENST00000518237.5+ENST00000521636.1+ENST00000522495.5+ENST00000519154.5+ENST00000253513.11"; exonic_part_number "018"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39918934    39918944    .   +   .   transcripts "ENST00000253513.11"; exonic_part_number "019"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39920100    39920114    .   +   .   transcripts "ENST00000518237.5+ENST00000521636.1+ENST00000522495.5+ENST00000519154.5"; exonic_part_number "020"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39922552    39922651    .   +   .   transcripts "ENST00000521480.1+ENST00000522495.5+ENST00000519154.5+ENST00000518237.5+ENST00000521636.1+ENST00000253513.11"; exonic_part_number "021"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39922652    39922654    .   +   .   transcripts "ENST00000521636.1+ENST00000519154.5"; exonic_part_number "022"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39922655    39922911    .   +   .   transcripts "ENST00000521636.1"; exonic_part_number "023"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39923469    39923586    .   +   .   transcripts "ENST00000518237.5+ENST00000521480.1+ENST00000522495.5+ENST00000253513.11"; exonic_part_number "024"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39923587    39923590    .   +   .   transcripts "ENST00000521480.1+ENST00000253513.11"; exonic_part_number "025"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39923591    39923931    .   +   .   transcripts "ENST00000521480.1"; exonic_part_number "026"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39924720    39924720    .   +   .   transcripts "ENST00000523779.1"; exonic_part_number "027"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39924721    39924772    .   +   .   transcripts "ENST00000518237.5+ENST00000523779.1+ENST00000522495.5+ENST00000253513.11"; exonic_part_number "028"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39925223    39925371    .   +   .   transcripts "ENST00000518237.5+ENST00000523779.1+ENST00000522495.5+ENST00000253513.11"; exonic_part_number "029"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39927830    39928138    .   +   .   transcripts "ENST00000518237.5+ENST00000523779.1+ENST00000522495.5+ENST00000253513.11"; exonic_part_number "030"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39928139    39928426    .   +   .   transcripts "ENST00000518237.5+ENST00000522495.5+ENST00000253513.11"; exonic_part_number "031"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39928427    39928431    .   +   .   transcripts "ENST00000518237.5+ENST00000253513.11"; exonic_part_number "032"; gene_id "ENSG00000131203.12"
chr8    dexseq_prepare_annotation.py    exonic_part 39928432    39928444    .   +   .   transcripts "ENST00000253513.11"; exonic_part_number "033"; gene_id "ENSG00000131203.12"

Then, I ran dexseq_count.py like this:

python2.7 dexseq_count.py -f bam -p yes -r pos -s no gencode.v23.annotation.gff sample1_Aligned.out.sorted.bam output.txt

And these are the corresponding counts for all 33 exonic parts of IDO1:

ENSG00000131203.12:001  1
ENSG00000131203.12:002  1
ENSG00000131203.12:003  1
ENSG00000131203.12:004  3
ENSG00000131203.12:005  0
ENSG00000131203.12:006  0
ENSG00000131203.12:007  0
ENSG00000131203.12:008  0
ENSG00000131203.12:009  0
ENSG00000131203.12:010  2
ENSG00000131203.12:011  2
ENSG00000131203.12:012  3
ENSG00000131203.12:013  1
ENSG00000131203.12:014  1
ENSG00000131203.12:015  1
ENSG00000131203.12:016  1
ENSG00000131203.12:017  1
ENSG00000131203.12:018  0
ENSG00000131203.12:019  0
ENSG00000131203.12:020  2
ENSG00000131203.12:021  6
ENSG00000131203.12:022  1
ENSG00000131203.12:023  1
ENSG00000131203.12:024  4
ENSG00000131203.12:025  0
ENSG00000131203.12:026  0
ENSG00000131203.12:027  0
ENSG00000131203.12:028  3
ENSG00000131203.12:029  4
ENSG00000131203.12:030  34
ENSG00000131203.12:031  29
ENSG00000131203.12:032  2
ENSG00000131203.12:033  2

However, when I looked in UCSC there are only 10 IDO1 exons. Because I am only interested in getting the exon level expression and no differential expression I am wondering if there is a way to modify the gff to just keep the actual exon coordinates? Or is there a way to merge these 33 exons to get counts for the 10 exons? Please advise.

Thanks!

iosonofabio commented 6 years ago

Sorry which are the 10 exons?

On December 19, 2017 7:19:21 AM PST, Komal Rathi notifications@github.com wrote:

Hi,

I have a bunch of samples from various tumor types in which I want to get the exon level expression of the gene IDO1. My aim is not to do any differential expression of any sort because there are no control samples. I prepared the gencode v23 gtf to gff using the following command:

python2.7 dexseq_prepare_annotation.py gencode.v23.annotation.gtf
gencode.v23.annotation.gff

This gave me 33 exonic parts for the gene IDO1:

chr8   dexseq_prepare_annotation.py    aggregate_gene  39902275    39928444    .   +   .   gene_id
"ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39902275    39902370    .   +   .   transcripts
"ENST00000518804.5"; exonic_part_number "001"; gene_id
"ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39902371    39902374    .   +   .   transcripts
"ENST00000518804.5+ENST00000519154.5"; exonic_part_number "002";
gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39902375    39902379    .   +   .   transcripts
"ENST00000518804.5+ENST00000522495.5+ENST00000519154.5";
exonic_part_number "003"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39902380    39902448    .   +   .   transcripts
"ENST00000518804.5+ENST00000522840.1+ENST00000522495.5+ENST00000519154.5";
exonic_part_number "004"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39908989    39909030    .   +   .   transcripts
"ENST00000522840.1"; exonic_part_number "005"; gene_id
"ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39909179    39909348    .   +   .   transcripts
"ENST00000518804.5+ENST00000522840.1"; exonic_part_number "006";
gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39913001    39913148    .   +   .   transcripts
"ENST00000522495.5"; exonic_part_number "007"; gene_id
"ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39913284    39913402    .   +   .   transcripts
"ENST00000518237.5"; exonic_part_number "008"; gene_id
"ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39913403    39913773    .   +   .   transcripts
"ENST00000518237.5+ENST00000519154.5"; exonic_part_number "009";
gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39913774    39913868    .   +   .   transcripts
"ENST00000518237.5+ENST00000522840.1+ENST00000519154.5";
exonic_part_number "010"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39913869    39913888    .   +   .   transcripts
"ENST00000518237.5+ENST00000522840.1+ENST00000253513.11+ENST00000519154.5";
exonic_part_number "011"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39913889    39914009    .   +   .   transcripts
"ENST00000518804.5+ENST00000522495.5+ENST00000519154.5+ENST00000518237.5+ENST00000522840.1+ENST00000253513.11";
exonic_part_number "012"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39917875    39917884    .   +   .   transcripts
"ENST00000518804.5+ENST00000522495.5+ENST00000519154.5+ENST00000518237.5+ENST00000522840.1+ENST00000253513.11";
exonic_part_number "013"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39917885    39917929    .   +   .   transcripts
"ENST00000518237.5+ENST00000522495.5+ENST00000519154.5+ENST00000518804.5+ENST00000521636.1+ENST00000522840.1+ENST00000253513.11";
exonic_part_number "014"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39917930    39917970    .   +   .   transcripts
"ENST00000518804.5+ENST00000522495.5+ENST00000519154.5+ENST00000518237.5+ENST00000521636.1+ENST00000253513.11";
exonic_part_number "015"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39918088    39918138    .   +   .   transcripts
"ENST00000518804.5+ENST00000522495.5+ENST00000519154.5+ENST00000518237.5+ENST00000521636.1+ENST00000253513.11";
exonic_part_number "016"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39918139    39918207    .   +   .   transcripts
"ENST00000518237.5+ENST00000521636.1+ENST00000522495.5+ENST00000519154.5+ENST00000253513.11";
exonic_part_number "017"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39918815    39918933    .   +   .   transcripts
"ENST00000518237.5+ENST00000521636.1+ENST00000522495.5+ENST00000519154.5+ENST00000253513.11";
exonic_part_number "018"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39918934    39918944    .   +   .   transcripts
"ENST00000253513.11"; exonic_part_number "019"; gene_id
"ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39920100    39920114    .   +   .   transcripts
"ENST00000518237.5+ENST00000521636.1+ENST00000522495.5+ENST00000519154.5";
exonic_part_number "020"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39922552    39922651    .   +   .   transcripts
"ENST00000521480.1+ENST00000522495.5+ENST00000519154.5+ENST00000518237.5+ENST00000521636.1+ENST00000253513.11";
exonic_part_number "021"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39922652    39922654    .   +   .   transcripts
"ENST00000521636.1+ENST00000519154.5"; exonic_part_number "022";
gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39922655    39922911    .   +   .   transcripts
"ENST00000521636.1"; exonic_part_number "023"; gene_id
"ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39923469    39923586    .   +   .   transcripts
"ENST00000518237.5+ENST00000521480.1+ENST00000522495.5+ENST00000253513.11";
exonic_part_number "024"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39923587    39923590    .   +   .   transcripts
"ENST00000521480.1+ENST00000253513.11"; exonic_part_number "025";
gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39923591    39923931    .   +   .   transcripts
"ENST00000521480.1"; exonic_part_number "026"; gene_id
"ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39924720    39924720    .   +   .   transcripts
"ENST00000523779.1"; exonic_part_number "027"; gene_id
"ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39924721    39924772    .   +   .   transcripts
"ENST00000518237.5+ENST00000523779.1+ENST00000522495.5+ENST00000253513.11";
exonic_part_number "028"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39925223    39925371    .   +   .   transcripts
"ENST00000518237.5+ENST00000523779.1+ENST00000522495.5+ENST00000253513.11";
exonic_part_number "029"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39927830    39928138    .   +   .   transcripts
"ENST00000518237.5+ENST00000523779.1+ENST00000522495.5+ENST00000253513.11";
exonic_part_number "030"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39928139    39928426    .   +   .   transcripts
"ENST00000518237.5+ENST00000522495.5+ENST00000253513.11";
exonic_part_number "031"; gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39928427    39928431    .   +   .   transcripts
"ENST00000518237.5+ENST00000253513.11"; exonic_part_number "032";
gene_id "ENSG00000131203.12"
chr8   dexseq_prepare_annotation.py    exonic_part 39928432    39928444    .   +   .   transcripts
"ENST00000253513.11"; exonic_part_number "033"; gene_id
"ENSG00000131203.12"

Then, I ran dexseq_count.py like this:

python2.7 dexseq_count.py -f bam -p yes -r pos -s no
gencode.v23.annotation.gff sample1_Aligned.out.sorted.bam output.txt

And these are the corresponding counts for all 33 exonic parts of IDO1:

ENSG00000131203.12:001 1
ENSG00000131203.12:002 1
ENSG00000131203.12:003 1
ENSG00000131203.12:004 3
ENSG00000131203.12:005 0
ENSG00000131203.12:006 0
ENSG00000131203.12:007 0
ENSG00000131203.12:008 0
ENSG00000131203.12:009 0
ENSG00000131203.12:010 2
ENSG00000131203.12:011 2
ENSG00000131203.12:012 3
ENSG00000131203.12:013 1
ENSG00000131203.12:014 1
ENSG00000131203.12:015 1
ENSG00000131203.12:016 1
ENSG00000131203.12:017 1
ENSG00000131203.12:018 0
ENSG00000131203.12:019 0
ENSG00000131203.12:020 2
ENSG00000131203.12:021 6
ENSG00000131203.12:022 1
ENSG00000131203.12:023 1
ENSG00000131203.12:024 4
ENSG00000131203.12:025 0
ENSG00000131203.12:026 0
ENSG00000131203.12:027 0
ENSG00000131203.12:028 3
ENSG00000131203.12:029 4
ENSG00000131203.12:030 34
ENSG00000131203.12:031 29
ENSG00000131203.12:032 2
ENSG00000131203.12:033 2

However, when I looked in UCSC there are only 10 IDO1 exons. Because I am only interested in getting the exon level expression and no differential expression I am wondering if there is a way to modify the gff to just keep the actual exon coordinates? Or is there a way to merge these 33 exons to get counts for the 10 exons? Please advise.

Thanks!

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/simon-anders/htseq/issues/47

komalsrathi commented 6 years ago

Hi

These are the 10 exons corresponding to NM _002164.5 as seen here: https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?context=gene&acc=3620

These are the coordinates from the supplementary section of this paper:

chr8    39771328    39771528
chr8    39775394    39775489
chr8    39775607    39775726
chr8    39776334    39776463
chr8    39777619    39777633
chr8    39780071    39780170
chr8    39780988    39781109
chr8    39782240    39782291
chr8    39782742    39782890
chr8    39785349    39785945

Could it be because I am looking at various transcripts here instead of just one?

iosonofabio commented 6 years ago

wrong repo, this is HTSeq not dexseq: http://bioconductor.org/packages/release/bioc/html/DEXSeq.html

closing