Closed pengxin2019 closed 3 years ago
How did you generate the gene annotation granges
object?
What exactly type of information from the annotation are used to plot the genes?
Hi Tim, just a follow up. Did you have an opportunity to check with this bug?
Thanks Best
The annotation information must contain a "type" column (cds, exon, gap, utr). I would suggest creating the annotations as shown in the Signac vignettes, and then re-creating the same format (with all the same columns) with your custom genome annotation
Hi, I am using the Ensembl annotations and have made sure that I have a type
column. However, I've noticed that my coverage plots don't show the gene name, even though there is a gene_name
column in my genome annotation. Any thoughts on why this is? Thanks!
This is what my annotation looks like:
GRanges object with 6 ranges and 20 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] 1 396700-409750 + | ensembl_havana gene NA
[2] 1 396700-409676 + | ensembl transcript NA
[3] 1 396700-396905 + | ensembl exon NA
[4] 1 397780-397788 + | ensembl exon NA
[5] 1 399062-399070 + | ensembl exon NA
[6] 1 399557-399827 + | ensembl exon NA
phase gene_id gene_version gene_name gene_source
<integer> <character> <character> <character> <character>
[1] <NA> ENSRNOG00000046319 4 AABR07000046.1 ensembl_havana
[2] <NA> ENSRNOG00000046319 4 AABR07000046.1 ensembl_havana
[3] <NA> ENSRNOG00000046319 4 AABR07000046.1 ensembl_havana
[4] <NA> ENSRNOG00000046319 4 AABR07000046.1 ensembl_havana
[5] <NA> ENSRNOG00000046319 4 AABR07000046.1 ensembl_havana
[6] <NA> ENSRNOG00000046319 4 AABR07000046.1 ensembl_havana
gene_biotype transcript_id transcript_version
<character> <character> <character>
[1] processed_transcript <NA> <NA>
[2] processed_transcript ENSRNOT00000044187 4
[3] processed_transcript ENSRNOT00000044187 4
[4] processed_transcript ENSRNOT00000044187 4
[5] processed_transcript ENSRNOT00000044187 4
[6] processed_transcript ENSRNOT00000044187 4
transcript_name transcript_source transcript_biotype exon_number
<character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA>
[2] AABR07000046.1-202 ensembl processed_transcript <NA>
[3] AABR07000046.1-202 ensembl processed_transcript 1
[4] AABR07000046.1-202 ensembl processed_transcript 2
[5] AABR07000046.1-202 ensembl processed_transcript 3
[6] AABR07000046.1-202 ensembl processed_transcript 4
exon_id exon_version protein_id protein_version tag
<character> <character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA> <NA>
[2] <NA> <NA> <NA> <NA> <NA>
[3] ENSRNOE00000493937 1 <NA> <NA> <NA>
[4] ENSRNOE00000544646 1 <NA> <NA> <NA>
[5] ENSRNOE00000574883 1 <NA> <NA> <NA>
[6] ENSRNOE00000481734 2 <NA> <NA> <NA>
-------
seqinfo: 162 sequences from Rnor_6.0 genome; no seqlengths
UPDATE: I noticed that it does sometimes show the gene name, but not always. I would guess that the latter happens when the entire gene isn't in the plotting frame. Is there any way to make this show the gene name for genes that are even partially located in the plotting frame? Thanks!
Hi Jessica,
I hope this finds you well.
Would you mind if I ask whether you can give me some hint on how to get the type
column of the annotation? How did you get the annotation object? I got it from AnnotationHub
and my annotation
looks like this:
head(Annotation(integrated))
# GRanges object with 6 ranges and 8 metadata columns:
# GRanges object with 6 ranges and 8 metadata columns:
# seqnames ranges strand | gene_id
# <Rle> <IRanges> <Rle> | <character>
# ENSSSCG00000037372 1 3472-18696 - | ENSSSCG00000037372
# ENSSSCG00000027257 1 23368-40113 + | ENSSSCG00000027257
# ENSSSCG00000029697 1 96218-186785 - | ENSSSCG00000029697
# ENSSSCG00000027274 1 112763-113499 - | ENSSSCG00000027274
# ENSSSCG00000027726 1 198992-211342 + | ENSSSCG00000027726
# ENSSSCG00000033475 1 352196-356072 + | ENSSSCG00000033475
# gene_name gene_biotype seq_coord_system
# <character> <character> <character>
# ENSSSCG00000037372 TBP protein_coding chromosome
# ENSSSCG00000027257 PSMB1 protein_coding chromosome
# ENSSSCG00000029697 FAM120B protein_coding chromosome
# ENSSSCG00000027274 protein_coding chromosome
# ENSSSCG00000027726 DLL1 protein_coding chromosome
# ENSSSCG00000033475 lincRNA chromosome
# description gene_id_version symbol
# <character> <character> <character>
# ENSSSCG00000037372 TATA-box binding pro.. ENSSSCG00000037372.1 TBP
# ENSSSCG00000027257 proteasome subunit b.. ENSSSCG00000027257.2 PSMB1
# ENSSSCG00000029697 family with sequence.. ENSSSCG00000029697.2 FAM120B
# ENSSSCG00000027274 NULL ENSSSCG00000027274.2
# ENSSSCG00000027726 delta like canonical.. ENSSSCG00000027726.2 DLL1
# ENSSSCG00000033475 NULL ENSSSCG00000033475.1
# entrezid
# <list>
# ENSSSCG00000037372 110259740
# ENSSSCG00000027257 100621969
# ENSSSCG00000029697 100620583
# ENSSSCG00000027274 <NA>
# ENSSSCG00000027726 100620481
# ENSSSCG00000033475 <NA>
# -------
# seqinfo: 352 sequences
Thanks Best penny
Hi Penny,
I ended up downloading the GTF file from Ensembl corresponding to the assembly that I aligned my reads to and that GTF file had a gene type column. I loaded the file into R as a GRanges object (you can use the import
function from the rtracklayer
package to directly load in the GTF file as a GRanges object or you can read it in as a data frame and convert to a GRanges object yourself). I then set that GRanges object as the annotation for my Signac object. For example, something like:
Annotation(atac) <- gtf.gr
Hope that helps!
Hi Jessica, This is sweet! Thanks for your suggestions!! Good luck with your analysis.
Best Penny
Hi - I'm having the same issue where I am getting the error
Error in annotation[annotation$type == "body", ] :
incorrect number of dimensions
when using CoveragePlot()
with annotation = TRUE
and I can't figure out why. I am reading in a gtf file using rtracklayer
and there is a type
column. Has anyone figured out why this is happening?
@14zac2 have you found a solution to your problem? I am having the exact same issue.
@MoritzTh I did! My GTF file was missing the column gene_biotype
so I simply modified my GTF file to have that column, just putting in "protein_coding" for every row. Not sure if the value in the column makes any difference, but not having it there was causing the bug:
gtf$gene_biotype <- "protein_coding"
Thanks for answering! The gene_biotype
column is present in my GTF file, so not sure what the issue might be here.
@MoritzTh no problem! I found that the missing column was an issue with mine by going through the code of the graphing function line by line (you can look at the function by going View(CoveragePlot)
and noticing that the column was used in the function. Maybe this process would help you, as well! I would just try to make sure that all of your custom files are formatted exactly as expected.
I'm having the exact same issues as well. I've implemented the fix putting "protein_coding" in the gene_biotype
field, and also making sure a regular type
field is present and has CDS, exon, gene, transcript etc.
The output of my granges annotation object is below. Am I missing anything? Any ideas what could be causing these errors?
> head(gene.coords)
GRanges object with 6 ranges and 22 metadata columns:
seqnames ranges strand | source type score phase gene_id
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> <character>
[1] myo1 1-2076 + | NA gene NA <NA> myotoxin1
[2] Z 5240-9278 - | maker gene NA <NA> maker-Z-augustus-gen..
[3] Z 43556-53413 + | maker gene NA <NA> maker-Z-augustus-gen..
[4] Z 74320-75666 + | maker gene NA <NA> augustus_masked-Z-pr..
[5] Z 98879-137683 + | maker gene NA <NA> augustus_masked-Z-pr..
[6] Z 139942-146170 + | maker gene NA <NA> maker-Z-augustus-gen..
transcript_id gene_name Parent gene_biotype Anolis_Blast_Type
<character> <character> <character> <character> <character>
[1] myotoxin_model_1 myotoxin1 <NA> protein_coding <NA>
[2] <NA> maker-Z-augustus-gen.. <NA> protein_coding ONEWAY
[3] <NA> maker-Z-augustus-gen.. <NA> protein_coding RBB
[4] <NA> augustus_masked-Z-pr.. <NA> protein_coding RBB
[5] <NA> augustus_masked-Z-pr.. <NA> protein_coding RBB
[6] <NA> maker-Z-augustus-gen.. <NA> protein_coding ONEWAY
Anolis_Homolog Crovir_Transcript_ID Name Python_Blast_Type
<character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA>
[2] XP_016852035.1_10055.. crovir-transcript-1688 maker-Z-augustus-gen.. ONEWAY
[3] XP_003225965.3_10055.. crovir-transcript-1686 maker-Z-augustus-gen.. RBB
[4] XP_008116758.2_10328.. crovir-transcript-1684 augustus_masked-Z-pr.. RBB
[5] XP_008121683.1_10055.. crovir-transcript-1685 augustus_masked-Z-pr.. RBB
[6] XP_008121683.1_10055.. crovir-transcript-1687 maker-Z-augustus-gen.. ONEWAY
Python_Homolog Thamnophis_Blast_Type Thamnophis_Homolog X_AED X_QI
<character> <character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA> <NA>
[2] XP_015744721.1_10306.. RBB XP_013928481.1_10655.. <NA> <NA>
[3] XP_007435060.1_10306.. RBB XP_013923894.1_10655.. <NA> <NA>
[4] XP_007443267.1_10305.. ONEWAY XP_013927888.1_10655.. <NA> <NA>
[5] XP_007434210.1_10305.. RBB XP_013913927.1_10654.. <NA> <NA>
[6] XP_015744560.1_10305.. ONEWAY XP_013913930.1_10654.. <NA> <NA>
X_eAED Crovir_Protein_ID previous_transcript_id
<character> <character> <character>
[1] <NA> <NA> <NA>
[2] <NA> <NA> <NA>
[3] <NA> <NA> <NA>
[4] <NA> <NA> <NA>
[5] <NA> <NA> <NA>
[6] <NA> <NA> <NA>
-------
seqinfo: 21 sequences from an unspecified genome; no seqlengths
So am I. SidG13 could not have shown it better. I am stuck at the same spot. How should the GRanges object be modified to have CoveragePlot show gene models?
@SidG13 Hi! I was having the same problem. I think CoveragePlot needs the "tx_id" column. It finally worked when I changed the "gene_id" to "tx_id" in the annotation from ensembl. Hope it helps!
I know, it's a very belated reply, but maybe it will help somebody else. For me the run failed if BOTH type and biotype info were present, as the script subsetting the object for features ends up subsetting extra type column and dropping the last needed other column instead, so the table used for plotting the gene or feature becomes invalid.
I'm having the exact same issues as well. I've implemented the fix putting "protein_coding" in the
gene_biotype
field, and also making sure a regulartype
field is present and has CDS, exon, gene, transcript etc.The output of my granges annotation object is below. Am I missing anything? Any ideas what could be causing these errors?
> head(gene.coords) GRanges object with 6 ranges and 22 metadata columns: seqnames ranges strand | source type score phase gene_id <Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer> <character> [1] myo1 1-2076 + | NA gene NA <NA> myotoxin1 [2] Z 5240-9278 - | maker gene NA <NA> maker-Z-augustus-gen.. [3] Z 43556-53413 + | maker gene NA <NA> maker-Z-augustus-gen.. [4] Z 74320-75666 + | maker gene NA <NA> augustus_masked-Z-pr.. [5] Z 98879-137683 + | maker gene NA <NA> augustus_masked-Z-pr.. [6] Z 139942-146170 + | maker gene NA <NA> maker-Z-augustus-gen.. transcript_id gene_name Parent gene_biotype Anolis_Blast_Type <character> <character> <character> <character> <character> [1] myotoxin_model_1 myotoxin1 <NA> protein_coding <NA> [2] <NA> maker-Z-augustus-gen.. <NA> protein_coding ONEWAY [3] <NA> maker-Z-augustus-gen.. <NA> protein_coding RBB [4] <NA> augustus_masked-Z-pr.. <NA> protein_coding RBB [5] <NA> augustus_masked-Z-pr.. <NA> protein_coding RBB [6] <NA> maker-Z-augustus-gen.. <NA> protein_coding ONEWAY Anolis_Homolog Crovir_Transcript_ID Name Python_Blast_Type <character> <character> <character> <character> [1] <NA> <NA> <NA> <NA> [2] XP_016852035.1_10055.. crovir-transcript-1688 maker-Z-augustus-gen.. ONEWAY [3] XP_003225965.3_10055.. crovir-transcript-1686 maker-Z-augustus-gen.. RBB [4] XP_008116758.2_10328.. crovir-transcript-1684 augustus_masked-Z-pr.. RBB [5] XP_008121683.1_10055.. crovir-transcript-1685 augustus_masked-Z-pr.. RBB [6] XP_008121683.1_10055.. crovir-transcript-1687 maker-Z-augustus-gen.. ONEWAY Python_Homolog Thamnophis_Blast_Type Thamnophis_Homolog X_AED X_QI <character> <character> <character> <character> <character> [1] <NA> <NA> <NA> <NA> <NA> [2] XP_015744721.1_10306.. RBB XP_013928481.1_10655.. <NA> <NA> [3] XP_007435060.1_10306.. RBB XP_013923894.1_10655.. <NA> <NA> [4] XP_007443267.1_10305.. ONEWAY XP_013927888.1_10655.. <NA> <NA> [5] XP_007434210.1_10305.. RBB XP_013913927.1_10654.. <NA> <NA> [6] XP_015744560.1_10305.. ONEWAY XP_013913930.1_10654.. <NA> <NA> X_eAED Crovir_Protein_ID previous_transcript_id <character> <character> <character> [1] <NA> <NA> <NA> [2] <NA> <NA> <NA> [3] <NA> <NA> <NA> [4] <NA> <NA> <NA> [5] <NA> <NA> <NA> [6] <NA> <NA> <NA> ------- seqinfo: 21 sequences from an unspecified genome; no seqlengths
As you can see from picture above, I do not have the gene at the bottom of plot. But I do want it. Does this error result from that there is no information on whether the region on each row correspond to an exon or intron? For example, in the "seq_coord_system" of the head(Annotation(integrated)). it only says "chromosome" rather saying "intron" or "exon"
Thanks