Open YannDussert opened 4 months ago
I am also having the same problem. I picked this up because I tried to convert the BAMs generated by vg surject to CRAMs which gives the following error using samtools:
[W::cram_encode_aux] Missing '@ RG' header for RG "ID:1 LB:lib1 SM:IC_UTH_00023 PL:illumina PU:unit 1"
This is the command I am running: singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.11 \ bash -c "vg surject -x /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/hprc-v1.1-mc-chm13.gbz \ -G ./hprc-v1.1-mc-chm13.${2}.new.gaf.gz --interleaved -F /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/grch38.paths.txt -b -N ${2} \ -R 'ID:1 LB:lib1 SM:${2} PL:illumina PU:unit1' > hprc-v1.1-mc-chm13.${2}_surject_hg38.bam"
I also tried the latest docker image (docker://quay.io/comparative-genomics-toolkit/cactus:v2.8.0-gpu) and the problem persists.
How do we add the \@ RG line to the BAM file generated by vg surject? Is it possible for vg surject to output in CRAM format?
Thanks for the help
Kind regards, Melissa
If it's only a header issue, you can fix it easily with samtools. Extract the header with samtools view -H
, update it as necessary, and apply the new header with samtools reheader
.
As for the actual issue, what does the header look like? These commands should include sample name and read group in the header, and the relevant code path should include them in the output, at least by superficial inspection.
Thanks for the feedback. I also have BAM files generated with BWA. Here the header contains 1 \@ HD line, followed by multiple \@ SQ lines, then 1 \@ RG line followed by multiple \@ PG lines before the records begin:
[lines before]
\@SQ SN:HLA-DRB1*16:02:01 LN:11005 AH:* M5:4a972df76bd3ee2857b87bd5be5ea00a UR:/cbio/projects/003/charlton/nextflow/samrc_wes/align/nextflow-work/d0/651e1eadc049faa2b86abc2a1a3670/Homo_sapiens_assembly38.fasta
\@RG ID:0.0 LB:LIBA SM:NRG60.1CAN PL:Illumina
\@PG ID:bwa CL:/opt/sentieon-genomics-202112.07/libexec/bwa mem -R \@RG\tID:0.0\tLB:LIBA\tSM:NRG60.1CAN\tPL:Illumina -t 32 -K 100000000 -Y Homo_sapiens_assembly38.fasta V350190975_A_L01_103_1.fq.gz V350190975_A_L01_103_2.fq.gz PN:bwa VN:0.7.17-r1188
[lines after]
Here is an example of the BAM header in BAM files output by vg surject which lack the \@ RG line:
[lines before]
\@SQ SN:GRCh38#0#chrY_KI270740v1_random LN:37240 M5:69e42252aead509bf56f1ea6fda91405 UR:file:/cbio/projects/003/melissa/h3abionet_refgraph/GRCh38.fa
\@PG ID:samtools PN:samtools VN:1.11 CL:/home/cactus/bin/samtools reheader /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/GRCh38.dict ./hprc-v1.1-mc-chm13.IC_UTH_00023_surject_hg38.bam
\@PG ID:samtools.1 PN:samtools PP:samtools VN:1.11 CL:/home/cactus/bin/samtools sort -O BAM -o ./hprc-v1.1-mc-chm13.IC_UTH_00023_surject_hg38_sort.bam --threads 8 ./hprc-v1.1-mc-chm13.IC_UTH_00023_surject_hg38.bam
[lines after]
I also tried to output in CRAM format directly from vg surject by swopping the -b flag with the -c flag:
singularity exec -H $(pwd) docker://quay.io/comparative-genomics-toolkit/cactus:v2.6.11
bash -c "vg surject -x /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/hprc-v1.1-mc-chm13.gbz
-G ./hprc-v1.1-mc-chm13.${2}.new.gaf.gz --interleaved -F /cbio/projects/003/databases/minigraph-cactus_pangenomes/CHM13_graph/grch38.paths.txt -c -N ${2}
-R 'ID:1 LB:lib1 SM:${2} PL:illumina PU:unit1' > hprc-v1.1-mc-chm13.${2}_surject_hg38.bam"
But then I get the following error:
INFO: Using cached SIF image [E::cram_get_ref] Failed to populate reference for id 0 [vg::HTSWriter] error: failed to write the SAM header
Is there anything else I need to add to the command to output in CRAM format? Thanks for the suggestion to edit the BAM header- that could also be an option.
When you use vg surject
or map directly to htslib formats with options -N X -R Y
, vg creates a SAM header with the following line:
@RG ID:Y SM:X
The header is then passed to htslib, which parses it and returns one of its internal data structures. @melnel000, you seem to be providing all tags of the RG line as the read group identifier, and its plausible that the information may not survive the transformations intact.
Another possible reason is that you are mapping initially to GAF format. GAF is a minimalistic format, and there is no standardized way of storing sample names or read groups.
As for CRAM output, CRAM is a reference-based format, and htslib needs reference fasta to output it. htslib apparently uses environment variables REF_PATH
and REF_CACHE
to locate them, but I'm not familiar with the specifics. Anyway, CRAM output is not included in vg tests, so it's possible that vg cannot actually output CRAM.
1. What were you trying to do? Map short reads on a graph with vg map --surject-to bam
2. What did you want to happen? I'd like to add the sample & read group IDs in the BAM output
3. What actually happened? The sample & read group IDs were absent from the BAM header (there was no @ RG line) or in the read tags.
4. If you got a line like
Stack trace path: /somewhere/on/your/computer/stacktrace.txt
, please copy-paste the contents of that file here: NA5. What data and command can the vg dev team use to make the problem happen?
6. What does running
vg version
say?Thanks in advance!