tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

Is read group necessary? #68

Closed liuqianhn closed 4 years ago

liuqianhn commented 4 years ago

Hi Developer, I am using your nice tool to detect repeats on short-read data. However, I have the following error. Could you please advise whether read groups/bam-samps is necessary? I only have one sample.

ERROR: Provided BAM/CRAM files don't contain read groups in the header and the --bam-samps flag was not specified

My command is

HipSTR --bams HG001.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam --fasta hg38.fa --regions rep4.bed --str-vcf hip_4_NA.vcf.gz > log/hip_4_NA.log
tfwillems commented 4 years ago

Hi @liuqianhn,

In theory, a BAM file can contain reads for many samples. So in default mode, HipSTR requires that each read has RG information, which it then uses to split up reads by samples prior to genotyping.

So unfortunately, even if you only have 1 BAM, it either must have i) RG information for each read or ii) you must use the --bam-samps and --bam-libs options. In your case, using the latter would look something like this: HipSTR --bams HG001.GRCh38_full_plus_hs38d1_analysis_set_minus_alts.300x.bam --fasta hg38.fa --regions rep4.bed --str-vcf hip_4_NA.vcf.gz --log log/hip_4_NA.log --bam-samps HG001 --bam-libs HG001

If you have more than 1 sample, I'd recommend genotyping them all in the same command, as it'll improve the genotypes due to allele discovery and it'll make it easier to compare across samples.

Since you only have 1 sample, I'd also recommend using the default stutter model by specifying the flag --def-stutter-model and you'll likely need to lower the minimum read requirement using something like --min-reads 30

See here for more details on these options

Best, Thomas

liuqianhn commented 4 years ago

Hi @tfwillems , thank you very much for your reply.

May I know a little more about BPDIFFS in the vcf output file? I knew it is "Base pair difference of each alternate allele from the reference allele". I thus guess, if there is no BPDIFFS for some lines, it means that both alleles are same as the reference. Please correct me if I am wrong. Thank you.

tfwillems commented 4 years ago

If there is no BPDIFFS in the INFO field, it's because there are no ALT alleles detected and the ALT column should therefore be ".". This situation occurs if no variation is detected in your samples, which is far more likely to occur if you're only genotyping 1 sample at a time