kcleal / dysgu

Toolkit for calling structural variants using short or long reads
MIT License
88 stars 10 forks source link

clarification needed on RG and samples #77

Closed warthmann closed 7 months ago

warthmann commented 7 months ago

Dear Kez, I am trying to pipe samtools view output to dysgu run, and I am uncertain of the behaviour to expect. It would be nice if you could clarify. I have a multi-sample bam file; 6 samples, each sample with a unique RG (Read Group). Three (3) samples in the bam file are control replicates (3 different RGs) and 3 samples are treatment replicates (another 3 RGs). For SV calling I would like to compare the treatment replicates jointly to the control replicates: 1) calling SVs jointly in all controls 2) calling SVs jointly in all treatment samples, 3) running dysgu filter with treatment vs control.

With the below I am piping all 3 control samples to dysgu, however, I am finding that the resulting svs.vcf contains only one sample labeled with the first RG in the list (sample-names-control). Is this now a joint sample only labeled with the first RG or has dysgu only analysed the first sample. Any help clarifying and/or advice in how to perform the desired operation in another way would be nice.

samtools view -bh -R sample-names-control \
../all_merged.bam Chr04 \
| dysgu run -p8 -v2 -x --clean \
../reference.fa \
./temp \
- \
> svs.vcf
kcleal commented 7 months ago

Hi @warthmann,

I understand what you are trying to do. Looking at the dysgu code, it should be jointly calling the 3 control samples, but the name in the vcf output will be given as the first sample in the the read group list, or the file name. There might also be a warning in the dysgu output stating more than one sample name in the read groups. The relevant code is below:

    if "RG" in infile.header:
        rg = infile.header["RG"]
        if "SM" in rg[0]:
            if len(rg) > 1:
                sms = set([i["SM"] for i in rg])
                if len(sms) > 1:
                    logging.warning("Warning: more than one sample in @RG, using first sample (SM) for output: {}".format(rg[0]["SM"]))
            sample_name = rg[0]["SM"]
        else:
            sample_name = os.path.splitext(os.path.basename(args["sv_aligns"]))[0]
            logging.warning("Warning: no SM tag in @RG (read-group), using input file name as sample name for output: {}".format(sample_name))
    else:
        sample_name = os.path.splitext(os.path.basename(args["sv_aligns"]))[0]
        logging.warning("Warning: no @RG, using input file name as sample name for output: {}".format(sample_name))

Another way to check to see what dysgu has analysed - if you keep the temp files (no --clean flag) and have a look at the .dysgu_reads.bam file (testing a few megabases of your input should be sufficient)

warthmann commented 7 months ago

alright! great! indeed, the warning was there: [WARNING] Warning: more than one sample in @ RG, using first sample (SM) for output: xxxx Also, the .dysgu_reads.bam file in the ./temp directory does contain read mappings from all 3 samples and they kept their respective RGs. samtools samples temp.dysgu_reads.bam A feature request hence will be to introduce an option of specifying a sample name for the joint sample and possibly even edit the RG flags.

warthmann commented 7 months ago

Qnother clarification thing. I am now doing dysgu filter -p8 treatment-svs.vcf control-temp/control-temp.dysgu_reads.bam > treatment-vs-con.vcf I am getting the same warning. with con-2_S1 being one of the control samples in the bam file [WARNING] Warning: more than one sample in @ RG, using first sample (SM) for output: con-2_S1 Is dysgu using all alignments in the control-temp.dysgu_reads.bam file for assessment/filtering or only the ones with RG con-2_S1?

kcleal commented 7 months ago

It will be using all the reads.