virajbdeshpande / AmpliconArchitect

AmpliconArchitect (AA) is a tool to identify one or more connected genomic regions which have simultaneous copy number amplification and elucidates the architecture of the amplicon. In the current version, AA takes as input next generation sequencing reads (paired-end Illumina reads) mapped to the hg19/GRCh37 reference sequence and one or more regions of interest. Please "watch" this repository for improvements in runtime, accuracy and annotations for GRCh38 human reference genome coming up soon.
Other
135 stars 43 forks source link

coverage.stats #141

Open pawanchk opened 1 year ago

pawanchk commented 1 year ago

Hi,

I have two queries regarding the coverage.stats file that gets updated in the data_repo folder everytime we run AmpliconArchitect for a particular sample -

  1. what are the different columns in this file ?
  2. are any of those data in the coverage.stats file also found in the AmpliconArchitect results folder for a particular sample (bam file) executed using AmpliconArchitect ?

Any advice would be helpful.

Best Regards, Pawan

jluebeck commented 1 year ago

Hi Pawan,

The coverage.stats file reports the following values in each line (list represents each column from left to right)

bam: path of the bam file wc_10000_median: median coverage in sampled windows of size 10000bp wc_10000_avg: average coverage in sampled windows of size 10000bp wc_10000_std: std. dev of coverage in sampled windows if size 10000bp wc_300_median: median coverage in sampled windows of size 300bp wc_300_avg: average coverage in sampled windows of size 300bp wc_300_std: std. dev of coverage in sampled windows if size 300bp read_length: read length in bam file insert_size: average insert size insert_std: insert size standard deviation min_insert: minimum detected insert size max_insert: maximum detected insert size pair_support: number of read pairs of support required to call a discordant edge (SV) percent_proper: estimated fraction of properly-paired reads num_sdevs: number of standard deviations used for insert-size boundaries (this is a user-specifiable AA parameter) bamfile_filesize: filesize of bam file (to help ensure the bam file matches its stats when re-using these statistics for re-runs).

The bam, pair_support and num_sdevs will be reported in the log files in a human-readable way.

It's important to note that these values are quickly computed estimates designed to help guide AA, and they are not exhaustive or exact summaries of those properties in the bam file. For more exact profiling of your bam file you may want to instead use samtools, or similar.

Thanks, Jens

pawanchk commented 1 year ago

Hi Jens,

Thanks for your response with the details, this is very helpful.

These are some follow-up queries that I have -

  1. Can I please know in which log file these info are provided - bam, pair_support and num_sdevs ?
  2. Also, what about the other columns, will they be found in any other log files and how important are these in inferring the AA results for ecDNA ?
  3. In case, we want to regenerate these data, what is the best way ?
jluebeck commented 1 year ago

Hi Pawan,

  1. The main [sample].log file produced by AA will report on various lines a line of text that says pair support 2 (or similar) and the both bam file and insert sdevs will be reported there as well in one of the first lines that shows the command that was run for AA. If insert sdevs is not reported, then it is using the default 3. You may of course find that if you really need this information that the coverage.stats file will be easier to extract it from.

  2. No, they are only reported in coverage.stats. They may also be printed to stdout when amplified_intervals.py is run (assuming you are using AmpliconSuite-pipeline. These values certainly are important for AA to properly estimate bulk copy numbers using coverage, and also to identify SVs. AA would not run without having this information.

  3. The best way to regenerate this data is to run AmpliconSuite-pipleline on your sample again (up to the point of generating seed regions - no need to set --run_AA, --run_AC).

Thanks, Jens