tgen / lumosVar2

Calls somatic SNVs, indels, and allelic copy number jointly across multiple samples from the same patient. These can be standard tumor/normal pair, longitudinal samples, primary/met, etc. Can also be used for tumor only calling, ideally with a high tumor content and a low tumor content sample.
MIT License
10 stars 1 forks source link

What are the columns in the output of runNormalMetrics? #8

Open kotliary opened 3 years ago

kotliary commented 3 years ago

Since there is no header, I'm trying to guess. Tried to read the gvm's code, but it's really not clear.

The first two are easy - chromosome and position. What are the other four?

I'm specifically interested in the 4th column. Got very different results between two runs with the same set of bam files. For the second run the bam files were just reordered with picard's ReorderSam, then re-indexed. Thanks

sid-code commented 3 years ago

I'll do my best to describe the fields that gvm outputs for normal metrics.

The code for normal metrics calculation is here: https://github.com/tgen/gvm/blob/master/src/nmcalc.c

the 4 fields are (in order) "Normalized read depth", "probability of mapping error", "per read pass", and "a/b fraction"

Normalized read depth: read depth normalized based on ploidy (I assume this is needed because haploid read depths will be generally different from diploid read-depths?)

Probability of mapping error: probability that this data point is due to a mapping error, calculated using the mean_mq (mapping quality) for that position and a binomial distribution. (This is the one I'm least sure of, my description here may be totally off)

Per read pass: proportion of reads at this point that meet the quality standard. https://github.com/tgen/gvm/blob/eaa664f584b23864244b33d27797e31d12c4ab07/src/gvm.c#L480 this line shows how this is enforced

a/b fraction (the one you're interested in): ab_frac = (a_read_depth + b_read_depth) / v->read_count_pass where read_count_pass is the number of reads at this position that pass the quality standard mentioned above. a_read_depth and b_read_depth are the respective read depths for the A and B alleles at that position.

Hope this clears some stuff up.

kotliary commented 3 years ago

Thank you so much. This is quite clear.

By the 4th column I meant starting from the 1st column - the probability of mapping error. This is the one that was different between two runs. I wonder how the bimodal distribution is calculated - by individual chromosome or using all the reads?

This problem is related to my previous issue #7.

Since my original bam files have "chr" prefix for chromosomes, and contigs starts from chrM, I tried to modify them (and the reference) before running runNormalMetrics.

So, for the first set I did: (1) reheader with samtools, (2) subsampled by bed-file to have less number of reads for further tests, (3) indexed. It changed the contig names but didn't changed their order and didn't removed chrM and exogenous chromosomes from the header. Also, I had to postprocess the chromosome numbers after runNormalMetrics, since their index was shifted by 1 due to chrM having index 0.

For the second set I added processing with picard: (0) reordered the reference and made the dictionary, (1) reheader with samtools, (2) run picard ReorderSam with reordered sequence dictionary, (3) subsampled, (4) indexed.

For the first set, the probability of mapping error was very small, near 0 for the most of positions. But for the second set it was very high - close to 1 for most of the positions. This is worrisome, and I'm trying to understand why. I'm not excluding that I did something wrong. But I compared the reads in the same bam file from two sets and found that the only differences are in reads with RNEXT and PNEXT on chrM or exogenous chromosomes, which became * in set 2.

Also for the second run, the Y chromosome returned the error (others run okay):

[E::fai_retrieve] Failed to retrieve block: unexpected end of file
gvm.c:1881: Failed to load reference genome from index. (err code -1)
 .../scripts/runNormalMetrics.py:28: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  config = yaml.load(f)
.../scripts/runNormalMetrics.py: gvm failed

Don't know what caused it, will continue testing.

I'll attach the idxstats output for the same bam file from two sets. I'd appreciate any suggestion. Thanks again.

idxstat1.txt idxstat2.txt