showteeth / ggcoverage

Visualize and annotate genomic coverage with ggplot2
https://showteeth.github.io/ggcoverage
Other
227 stars 20 forks source link

Non-eukaryotic data and input files? #9

Closed lxsteiner closed 6 months ago

lxsteiner commented 1 year ago

Hi,

This looks very useful but from all the examples it seems the package focuses mostly on eukaryotic omic data? I wanted to use this for bacterial genomes, to generate a similar plot: image

plotting genome coverage, GC, gene annotation, and CNV annotation (without the ideogram). Would this be doable for a bacterial/viral genome and corresponding data?

Looking at the handy tutorials you provided, unfortunately I don't see how to get started from the initial files without loading existing data, or what their format should be. For example, I have:

The way I understand the pipeline:

  1. Prepare GTF file like this:
    # prepare gtf
    gtf.file = system.file("extdata", "annotation.gtf", package = "ggcoverage")
    gtf.gr = rtracklayer::import.gff(con = gtf.file, format = 'gtf')
  2. Sample metadata - not sure how or if this is needed. I only have one sample in this case.
  3. Prepare BAM track data:
    # load bam file
    bam.file = system.file("extdata", "DNA-seq", "aligned.bam", package = "ggcoverage")
    track.df <- LoadTrackFile(
    track.file = bam.file
    )
  4. Load CNV file (is this just a VCF file?):
    cnv.file <- system.file("extdata", "DNA-seq", "variant.vcf", package = "ggcoverage")
    # read CNV
    cnv.df = read.table(file = cnv.file, sep = "\t", header = TRUE)
  5. Plot basic coverage:
    basic.coverage = ggcoverage(data = track.df, color = "grey", mark.region = NULL,
                            range.position = "out")
    basic.coverage
  6. Plot GC, CNV annotations, gene annotations:
    basic.coverage +
    geom_gc(fa.file=genome.fasta) +
    geom_cnv(cnv.df = cnv.df, bin.col = 3, cn.col = 4)
    geom_gene(gtf.gr=gtf.gr)

I guess it should look something like that? Thank you for any feedback!

showteeth commented 1 year ago

Hi, It's sure that you can use ggcoverage to visualize and annotate bacterial/viral genome coverage.

  1. For a basic coverage plot, ggcoverage requires:

  2. Adding GC content annotation requires genome FASTA file to calculate GC content.

  3. Adding gene annotation requires GTF file, and the gtf file should contain "seqnames", "start", "end", "strand", "type", "gene_name" columns and one of "gene_type", "gene_biotype". Example gtf file

  4. Adding CNV annotation requires a file containing copy number results according to Genome-wide copy number analysis of single cells. You can provide a customed file with four columns: chr, position, bin value (point), copy number(line). Example cnv file

  5. Adding SNV annotation requires bam file and genome FASTA file (reference base and aa).

Hope it helps,

YB

AceM1188 commented 11 months ago

This is actually something I'm looking to accomplish for plotting coverage of reads to a bacterial genome reference for one sample. From a very basic effort (as a novice), I've tried the following but am running into an error when I try to create a track dataframe from bam file.

S1_tracks = LoadTrackFile("../BaseSpace/Bacterialgenomes-402632365/FASTQ_Generation_2023-11-16_04_36_48Z-703830129/S1_bam/S1_sort.bam", track.folder = NULL, format = "bam", norm.method = "None", meta.info = sample.meta)

This generates the following error:

Calculate coverage with GenomicAlignments when norm.method is None! Error in .io_bam(.scan_bamfile, file, reverseComplement, yieldSize(file), : seqlevels(param) not in BAM header: seqlevels: 'chr14' file: ../BaseSpace/Bacterialgenomes-402632365/FASTQ_Generation_2023-11-16_04_36_48Z-703830129/S1_bam/S1_sort.bam index: ../BaseSpace/Bacterialgenomes-402632365/FASTQ_Generation_2023-11-16_04_36_48Z-703830129/S1_bam/S1_sort.bam

I don't know where the 'chr14' is coming from particularly as my alignment is to a single bacterial genome.

Of note, here's my meta data file: sample.meta SampleName Type Group
1 S1 BAL Patient

Here's a text file of my session info. session-info.txt And here's the link to my sorted bam file: https://drive.google.com/file/d/1kf0oGu2h35F5wr2OwAse3OkoNWe5IVwb/view?usp=sharing

I'm sure this is something ridiculously simple, but I'm just not figuring it out. Could you help me understand what the error is here?

m-jahn commented 7 months ago

Hi, sorry for late reply, but now I looked into that issue. The problem is that the LoadTrackFile function is currently quite inflexible. Many options are hardcoded towards human data.

To make your exmaple work, you need to specify the "region" argument and your samples in the meta data file need to correspond to the name of the BAM file. This example works for me:

library(ggcoverage)
library(dplyr)

sample.meta <- data.frame(
  SampleName = "S1_sort",
  Type = "BAL",
  Group = "Patient"
)

# look for name of assembly/chromosomes in bam file
data = rtracklayer::import("~/Downloads/S1_sort.bam", format = "bam")
seqlevels(data)

# supply custom region
df_coverage = LoadTrackFile(
  track.file = "~/Downloads/S1_sort.bam",
  track.folder = NULL,
  format = "bam",
  norm.method = "None",
  meta.info = sample.meta,
  region = "NZ_CP065651.1:25,000-30,000"
)

# need to fix the end coordinate because binning is currently not
# working for non-normlaized data
df_coverage$end <- df_coverage$end + 1
ggcoverage(df_coverage, facet.key = "Type", group.key = "Group")

image

It shouldn't be that complicated, will try to fix this.

m-jahn commented 7 months ago

Will be fixed with commit 8c537bbc13c6a95ebfa3f6d6c44c495e0a21af2f