thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
594 stars 64 forks source link

Two tibble gff3 files appear empty when loading in. #194

Open tania-k opened 4 months ago

tania-k commented 4 months ago

Hello gggenomes folks.

I am currently attempting to run your program on my dataset that are subsections of:

  1. A Nucleotide Fasta file- a portion of my chromosome JAEVHH010000002.1 from length 6094000-6107000 bp.
  2. A GFF3 file - Containing a portion of my GFF3 file ranging from 6094000-6107000 bp.
  3. A Transposable element annotation GFF3 file - Containing a portion of my original file ranging from 6094000-6107000 bp.

Looks like:

  1. FASTA (head)

    JAEVHH010000002.1 GCATCTTGATTGAGGACAGCTCTCTTAGGCAAATCTGACAGTTTGGGGATGTTGATTTGGTTAAGTGTCGCCATAGTGGA ACTTCTGATGGGATGCAGTTATAGGCAGTGTAAATAATGTGAAGATGACTGATCTTTCTTTAGATAGTTCAGGGATGAGT TCTGCAACAGCAATGAACAGGCATAGACAAGAAGAAAAGCAAATACAATAAGAGATGAACAGCAGTTCACTGTGTTCCTG TTATATGAGTCAAGAGATCTGACTGACATTAGTTGCTGCTGTAACAACACAACAGTGAATGTCAGTATCTTATACCTCAA GAAACATCCCAACAAAGACTAATAAGAACTAGATCATGGCTGACCACTGCCAAACTCTATGAACTTAACACATGGAAAAC ACTTCAACTTATAAATAGCAAATGTAAATGCCTAGAGTGCATGGTGGTCTGAAGTGTGGAACAGGCGTTTCTTCTTCTAC TGCAGCTAAATAAACCATCTGGTAGTAGCAATTGAAGTTTTCTtcccttctttttcttttagaCCCCGATAACGCTCCGT GTATATGTATGTAACAGTGATTTTGAGGGGCATCATTCCCGTTGGAAATATTGACGTAGAATTGCATATCAGAGAATGAG TAGCTTCCACGTAGCAATAATTAACAGAGAGAATAAGAAACATCAGACTTAGATAGTGAACTACCTAGAAACTGCCTATA

  2. GFF3 (head)

    gff-version 3

    JAEVHH010000002.1 Genbank gene 6094524 6097279 . - . ID=gene-I7I48_07445;Name=I7I48_07445;Note=similar to ucsf_hc.01_1.G217B.02409;gbkey=Gene;gene_biotype=protein_coding;locus_tag=I7I48_07445 JAEVHH010000002.1 Genbank mRNA 6094524 6097279 . - . ID=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;Parent=gene-I7I48_07445;gbkey=mRNA;locus_tag=I7I48_07445;orig_protein_id=gnl|WGS:JAEVHH|I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase JAEVHH010000002.1 Genbank exon 6096870 6097279 . - . ID=exon-gnl|WGS:JAEVHH|mrna.I7I48_07445-1;Parent=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;gbkey=mRNA;locus_tag=I7I48_07445;orig_protein_id=gnl|WGS:JAEVHH|I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase JAEVHH010000002.1 Genbank exon 6096553 6096792 . - . ID=exon-gnl|WGS:JAEVHH|mrna.I7I48_07445-2;Parent=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;gbkey=mRNA;locus_tag=I7I48_07445;orig_protein_id=gnl|WGS:JAEVHH|I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase JAEVHH010000002.1 Genbank exon 6095606 6096476 . - . ID=exon-gnl|WGS:JAEVHH|mrna.I7I48_07445-3;Parent=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;gbkey=mRNA;locus_tag=I7I48_07445;orig_protein_id=gnl|WGS:JAEVHH|I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase JAEVHH010000002.1 Genbank exon 6094524 6095543 . - . ID=exon-gnl|WGS:JAEVHH|mrna.I7I48_07445-4;Parent=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;gbkey=mRNA;locus_tag=I7I48_07445;orig_protein_id=gnl|WGS:JAEVHH|I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase JAEVHH010000002.1 Genbank CDS 6096870 6096893 . - 0 ID=cds-KAG5298099.1;Parent=rna-gnl|WGS:JAEVHH|mrna.I7I48_07445;Dbxref=NCBI_GP:KAG5298099.1;Name=KAG5298099.1;gbkey=CDS;locus_tag=I7I48_07445;orig_transcript_id=gnl|WGS:JAEVHH|mrna.I7I48_07445;product=serum paraoxonase/arylesterase;protein_id=KAG5298099.1

  3. TE GFF3 (head)

    gff-version 3

    JAEVHH010000002.1 EDTA Copia_LTR_retrotransposon 6106870 6106957 521 + . ID=TE_homo_3261;Name=TE_00000452_LTR;Classification=LTR/Copia;Sequence_ontology=SO:0002264;Identity=0.92;Method=homology JAEVHH010000002.1 EDTA Copia_LTR_retrotransposon 6106958 6107257 1063 + . ID=TE_homo_3262;Name=TE_00000131_INT;Classification=LTR/Copia;Sequence_ontology=SO:0002264;Identity=0.861;Method=homology JAEVHH010000002.1 EDTA LTR_retrotransposon 6107268 6107505 743 + . ID=TE_homo_3263;Name=TE_00000906_LTR;Classification=LTR/unknown;Sequence_ontology=SO:0000186;Identity=0.836;Method=homology JAEVHH010000002.1 EDTA Gypsy_LTR_retrotransposon 6107921 6108480 3528 + . ID=TE_homo_3265;Name=TE_00000327_INT;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.881;Method=homology JAEVHH010000002.1 EDTA Gypsy_LTR_retrotransposon 6108475 6108728 1604 + . ID=TE_homo_3266;Name=TE_00000327_INT;Classification=LTR/Gypsy;Sequence_ontology=SO:0002265;Identity=0.888;Method=homology JAEVHH010000002.1 EDTA Copia_LTR_retrotransposon 6108752 6109500 4635 + . ID=TE_homo_3268;Name=TE_00000131_INT;Classification=LTR/Copia;Sequence_ontology=SO:0002264;Identity=0.823;Method=homology JAEVHH010000002.1 EDTA Copia_LTR_retrotransposon 6109501 6109604 668 + . ID=TE_homo_3269;Name=TE_00000452_LTR;Classification=LTR/Copia;Sequence_ontology=SO:0002264;Identity=0.929;Method=homology

Called as:

  1. seq
  2. gff3
  3. repeat_edta

When running:

z <- gggenomes(genes=gff) + geom_gene() + geom_seq() + geom_feat(aes(color=type), data=feats(genes)) + geom_bin_label() + geom_seq_label()

I receive plot_zoom_png It has all my sequences on separate lines, but adding in any of the other two features, seq or repeat_edta throws errors.

z <- gggenomes(genes=gff, seqs=seq) + geom_gene() + geom_seq() + geom_feat(aes(color=type), data=feats(genes)) + geom_bin_label() + geom_seq_label() z Only sawtype=NAin genes and will treat everything astype="CDS".

z <- gggenomes(gff,repeat_edta) + 
  geom_gene() + 
  geom_seq() + 
  geom_feat(aes(color=type), data=feats(genes)) + 
  geom_bin_label() + 
  geom_seq_label()

Error inrequire_vars(): ! Required column(s) missing: • length Runrlang::last_trace()to see where the error occurred.

My first error adding all three files gives me: `> p <- gggenomes(seqs = seq, genes = gff, feats = repeat_edta) + geom_seq() +

I am unsure how to progress.

I am running gggenomes on RStudio run through Linux/HPCC. I've updated ggplots2 to 3.5.0 and other dependencies along with restarting my R session.

Thank you for your time.

thackl commented 4 months ago

Dear Tania,

thanks for reaching out.

The problem with your first error (genes + seqs) is that gggenomes only reads one thing from the fasta file with the sequence: the length, in your case 13000bp. But the coordinates of you genes are then 6107000. So when gggenomes tries to plot that, it cannot find any genes that fit on a 13000bp contig. If you want to zoom in on parts of sequences, you either not provide the sequence at all and gggenomes will just zoom in on the range of genes you provide, or you can explicitly set the length together with start end end for the sequence (which I did in the example below)

In the second case, you are reading your TE gff as seq= (second argument of gggenomes). What should work is z <- gggenomes(gff, feats=repeat_edta) + ....

This works for me (for the parts of the gffs that you provided)

# explicitly specify range of chromosome to plot
s0 <- tibble(
  seq_id = "JAEVHH010000002.1",
  length = 6107000,
  start = 6094000,
  end = 6107000
)

# read genes
g0 <- read_feats("genes.gff")
# read TE
t0 <- read_feats("te.gff")

gggenomes(g0, s0, t0) +
  geom_gene() + 
  geom_seq() + 
  geom_feat(aes(color=type), data=feats(genes)) + 
  geom_bin_label() + 
  geom_seq_label() +
  geom_gene() +
  geom_feat()

image

Hope that helps!

thackl commented 4 months ago

Hi Tania,I be happy to take another look tomorrow. Would you mind sharin your entire gene and TE gff file. Would make it easier to see what the issue is. I will of course treat the data confidential.BestThomasOn 3 Jul 2024 21:06, Tania Kurbessoian @.***> wrote: Hi thackl, Thank you for taking the time to explain that for me. I did get the visuals to work after some time... kinda. I realized my two tracks are on opposite strands. My gene GFF file on the negative strand, while my GFF containing TEs are on the positive strand. I am not sure if that is why my TEs are not appearing on the strand properly? I'm not receiving any errors. I've also adjusted the position option to identity and pile (default) but they just stack in one corner on the right. p <- gggenomes(gff, s0, repeat_edta) + geom_gene(aes(fill = name), data=feats(genes)) + geom_feat(linewidth = 5, position = "jitter", aes(color= type), data=feats(feats)) + geom_seq() + geom_bin_label() + guides(color = guide_legend(title = "Repetitive elements"),fill = guide_legend(title = "GeneID")) + scale_color_manual(values = c("#CC79A7", "#D55E00")) p Screenshot.2024-07-03.at.12.03.53.PM.png (view on web) Any suggestions?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

tania-k commented 4 months ago

Hi Thomas, I figured it out, required a bit of adjusting but here is my code in entirety (if folks fall into the same mis-step like me)

`s0 <- tibble( seq_id = "H_ohiense_G217B_JAEVHH010000002.1", length = 16000, start = 6094000, end = 6110000 )

have gene gff3 first, then sequence, then other gff3 file(repeat content) and pull from feats instead of genes.

p <- gggenomes(gff, s0, repeat_edta) + geom_gene(aes(fill = name), data=feats(genes)) + geom_feat(linewidth = 5, position = "identity", aes(color= type), data=feats(feats)) + scale_color_manual(values = c("#D55E00", "#CC79A7", "#56B4E9")) + geom_seq() + geom_bin_label() + guides(color = guide_legend(title = "Repetitive elements"),fill = guide_legend(title = "GeneID")) p `

I realized I needed to broaden my visual to capture the TEs as they were past the gene coordinates in my s0 variables. Attached is my result. AGS1_G217B

Thanks again for all the help!! Tania