No sample & read group field in BAM output with vg map

YannDussert commented 4 months ago

1. What were you trying to do? Map short reads on a graph with vg map --surject-to bam

2. What did you want to happen? I'd like to add the sample & read group IDs in the BAM output

3. What actually happened? The sample & read group IDs were absent from the BAM header (there was no @ RG line) or in the read tags.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here: NA

5. What data and command can the vg dev team use to make the problem happen?

vg map -t 30 -x graph.xg -g graph.gcsa --fastq AV-BH171-f_1P.fq.gz --fastq AV-BH171-f_2P.fq.gz --surject-to bam --sample AV-BH171-f --read-group AV-BH171-f > aln.bam

6. What does running vg version say?

vg version v1.52.0 "Bozen"

Thanks in advance!

melnel000 commented 3 months ago

I am also having the same problem. I picked this up because I tried to convert the BAMs generated by vg surject to CRAMs which gives the following error using samtools:

[W::cram_encode_aux] Missing '@ RG' header for RG "ID:1 LB:lib1 SM:IC_UTH_00023 PL:illumina PU:unit 1"

This is the command I am running: singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.11 \ bash -c "vg surject -x /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/hprc-v1.1-mc-chm13.gbz \ -G ./hprc-v1.1-mc-chm13.${2}.new.gaf.gz --interleaved -F /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/grch38.paths.txt -b -N ${2} \ -R 'ID:1 LB:lib1 SM:${2} PL:illumina PU:unit1' > hprc-v1.1-mc-chm13.${2}_surject_hg38.bam"

I also tried the latest docker image (docker://quay.io/comparative-genomics-toolkit/cactus:v2.8.0-gpu) and the problem persists.

How do we add the \@ RG line to the BAM file generated by vg surject? Is it possible for vg surject to output in CRAM format?

Thanks for the help

Kind regards, Melissa

jltsiren commented 3 months ago

If it's only a header issue, you can fix it easily with samtools. Extract the header with samtools view -H, update it as necessary, and apply the new header with samtools reheader.

jltsiren commented 3 months ago

As for the actual issue, what does the header look like? These commands should include sample name and read group in the header, and the relevant code path should include them in the output, at least by superficial inspection.

melnel000 commented 3 months ago

Thanks for the feedback. I also have BAM files generated with BWA. Here the header contains 1 \@ HD line, followed by multiple \@ SQ lines, then 1 \@ RG line followed by multiple \@ PG lines before the records begin:

[lines before]
\@SQ     SN:HLA-DRB1*16:02:01    LN:11005        AH:*    M5:4a972df76bd3ee2857b87bd5be5ea00a     UR:/cbio/projects/003/charlton/nextflow/samrc_wes/align/nextflow-work/d0/651e1eadc049faa2b86abc2a1a3670/Homo_sapiens_assembly38.fasta
\@RG     ID:0.0  LB:LIBA SM:NRG60.1CAN   PL:Illumina
\@PG     ID:bwa  CL:/opt/sentieon-genomics-202112.07/libexec/bwa mem -R \@RG\tID:0.0\tLB:LIBA\tSM:NRG60.1CAN\tPL:Illumina -t 32 -K 100000000 -Y Homo_sapiens_assembly38.fasta V350190975_A_L01_103_1.fq.gz V350190975_A_L01_103_2.fq.gz   PN:bwa  VN:0.7.17-r1188
[lines after]

Here is an example of the BAM header in BAM files output by vg surject which lack the \@ RG line:

[lines before]
\@SQ     SN:GRCh38#0#chrY_KI270740v1_random      LN:37240        M5:69e42252aead509bf56f1ea6fda91405     UR:file:/cbio/projects/003/melissa/h3abionet_refgraph/GRCh38.fa
\@PG     ID:samtools     PN:samtools     VN:1.11 CL:/home/cactus/bin/samtools reheader /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/GRCh38.dict ./hprc-v1.1-mc-chm13.IC_UTH_00023_surject_hg38.bam
\@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.11 CL:/home/cactus/bin/samtools sort -O BAM -o ./hprc-v1.1-mc-chm13.IC_UTH_00023_surject_hg38_sort.bam --threads 8 ./hprc-v1.1-mc-chm13.IC_UTH_00023_surject_hg38.bam
[lines after]

I also tried to output in CRAM format directly from vg surject by swopping the -b flag with the -c flag:

singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.11
bash -c "vg surject -x /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/hprc-v1.1-mc-chm13.gbz
-G ./hprc-v1.1-mc-chm13.${2}.new.gaf.gz --interleaved -F /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/grch38.paths.txt -c -N ${2}
-R 'ID:1 LB:lib1 SM:${2} PL:illumina PU:unit1' > hprc-v1.1-mc-chm13.${2}_surject_hg38.bam"

But then I get the following error:

INFO: Using cached SIF image [E::cram_get_ref] Failed to populate reference for id 0 [vg::HTSWriter] error: failed to write the SAM header

Is there anything else I need to add to the command to output in CRAM format? Thanks for the suggestion to edit the BAM header- that could also be an option.

jltsiren commented 3 months ago

When you use vg surject or map directly to htslib formats with options -N X -R Y, vg creates a SAM header with the following line:

@RG     ID:Y  SM:X

The header is then passed to htslib, which parses it and returns one of its internal data structures. @melnel000, you seem to be providing all tags of the RG line as the read group identifier, and its plausible that the information may not survive the transformations intact.

Another possible reason is that you are mapping initially to GAF format. GAF is a minimalistic format, and there is no standardized way of storing sample names or read groups.

As for CRAM output, CRAM is a reference-based format, and htslib needs reference fasta to output it. htslib apparently uses environment variables REF_PATH and REF_CACHE to locate them, but I'm not familiar with the specifics. Anyway, CRAM output is not included in vg tests, so it's possible that vg cannot actually output CRAM.

vgteam / vg

No sample & read group field in BAM output with vg map #4222