human-pangenomics / hpp_pangenome_resources

98 stars 3 forks source link

"No reference-sense paths available" in minigraph/cactus graphs #4

Open jdidion opened 2 years ago

jdidion commented 2 years ago

I am trying to use the GRCh38 minigraph/cactus graph to align reads and output them in GRCh38 coordinates. However, I get the following warning when running vg giraffe: warning:[vg::get_sequence_dictionary] No reference-sense paths available in the graph; falling back to generic paths. This is a bit cryptic, but I take it to mean that the reference paths cannot be found in the graph files and so the output will not be in GRCh38 coordinates. Is this the case, and if so is there a fix?

For reference, here is my command line:

vg giraffe \
  -m hprc-v1.0-mc-grch38.min \
  -d hprc-v1.0-mc-grch38.dist \
  -g hprc-v1.0-mc-grch38.gg \
  -H hprc-v1.0-mc-grch38.gbwt \
  -x hprc-v1.0-mc-grch38.xg \
  -f my.fastq.gz \
  -P -o SAM -t 14 \
| samblaster --ignoreUnmated \
| samtools sort -@2 -m 4G -M -o my.bam
jdidion commented 2 years ago

It looks like you just have to pass the reference path names using the --ref-paths of vg giraffe.

For anyone else that runs into this issue, the command to generate these path names is:

vg paths -g hprc-v1.0-mc-grch38.gbwt -L | grep GRCh38 > GRCh38_refpaths.txt

It might be useful to provide the reference paths file for both GRCh38 and CHM13 in the table on this repo's readme page.

glennhickey commented 2 years ago

@jdidion Thanks for the feedback. I imagine this is due to an issue of compatibility with these indexes (which were produced with an older version of vg) and recent versions of vg which are beginning to introduce some new notiions of path metadata.

@adamnovak What's the best thing to do here? Can that warning be ignored? I know that I for one need to updatte minigraph-cactus to use the latest vg, but that won't help the published indexes referenced above.

adamnovak commented 2 years ago

The warning just means that the graph doesn't have any paths that are explicitly marked as representing a reference. If all the paths in the graph are from the same reference anyway, you can just ignore it.

If the graph has paths from multiple references, like both GRCh38 and CHM13, and if neither sample is marked as being "the" reference, the SAM output will freely mix paths from GRCh38 and CHM13 to explain read coordinates. If you don't want that, you would need to use --ref-paths like you show to explain which paths you want the SAM in.

The best argument to --ref-paths is probably an HTSLib sequence dictionary file, which captures path order in the assembly and also total path length independent of how much of the path is covered in the graph; it would be a good idea to post .dict files for the two assemblies used here. @glennhickey Do you have those around?

As describe in https://github.com/vgteam/vg/wiki/VG-GBWT-Subcommand#setting-tags, you can also make a copy of the GBWT that marks the sample you want as a reference. Something like:

vg gbwt --set-tag "reference_samples=GRCh38" -o hprc-v1.0-mc-grch38-grch38ref.gbwt hprc-v1.0-mc-grch38.gbwt

vg might need a way to let you tell Giraffe and vg surject at runtime to just consider a sample a reference and override the file tags, and also a succinct way to use just one particular reference when a file has multiple reference samples stored.