Open jfy133 opened 1 year ago
Hi James et al. :)
My suggestion would be that you take the HPRC CHM13 minigraph-cactus pangenome graph and map your reads against it using vg giraffe
. As far as I know there are pre-build indices available at https://github.com/human-pangenomics/hpp_pangenome_resources#minigraphcactus. However, they may not be compatible with the current vg
version. I couldn't find any documentation which index version fits which vg
version. Ideally, it is possible for vg giraffe
to report reads with low mapping quality. Or reads that multimap, because then we would have to drop these, too. But I lack experience here.
Hopefully @jeizenga can elobarate more. Else I will bug him in the US personally :P
Note that Jordan will also give a tutorial on read mapping with vg giraffe
in the US, which might be more up-to-date than your link @jfy133. Will keep you posted.
One question from my side: Is there already a data set for benchmarking, or how would you evaluate this @d4straub?
As far as I know, there are up-to-date indexes at that link. If you want to find multimapping reads, you can use the -M
argument in vg giraffe
. If you want low MAPQ reads, you can pipe the results through vg filter -q {max mapq} -U -
.
Ah, that's good news, thanks @jeizenga!
@jfy133 @d4straub Do you now know a way forward? Happy to discuss this again in person.
I think we would need to experiment first, so it'll be a process! But this is some very useful first pointers bother for taxprofiler and also mag etc. - thanks both!
One thing I really like about taxprofiler is that it supports long reads, too. Is there anything comparable to giraffe for long reads?
The vg giraffe
developers are working on long read mapping, but it's not mature or stable yet. There's also GraphAligner
, which is pretty good for noisy long reads (ONT <= R9, PacBio CLR) but not as great for accurate long reads (ONT >= R10, PacBio HiFi). You could also look at minigraph
, but it's only appropriate for graphs that have primarily long nodes, i.e. ones that don't include point variation. There are also some experimental long read features with vg mpmap --nt-type DNA --read-length long
.
To complete the short-read mapper list: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad320/7160913?login=false. However, the tool can only start from VCF and not from GFA. So you would not be able to use the HPRC graphs. Maybe worth to monitor how the tool develops. There already is an issue open https://github.com/thomas-buechler-ulm/gedmap/issues/1.
Very interesting discussion! I think we do need to evaluate this, but this seems more than just an afternoon of work.
Is there already a data set for benchmarking, or how would you evaluate this
I haven't researched this, but that is an important question. I stumbled across publications that evaluated de-contamination, but as usual its either synthetic datasets or the ground truth isnt known, as far as I recall.
To help you get started, you can check out @jeizenga's MemPanG23 pangenome read mapping tutorials at https://pangenome.github.io/MemPanG23/#_practical_course_central_time_zone.
Description of feature
As originally raised by @d4straub on Slack, a big issue in metagenomics/microbiome research is insufficient removal of host sequences from libraries, with many public data uploads containing individual-identifiable sequences.
In my opinion (shared with others), one of the biggest causes of this is suboptimal reference genomes which do not capture the whole diversity of host reference genomes.
One solution to this is to map against reference graphs that can contain SNPs from more than one individuals/populations.
@subwaystation as a pangenome expert suggests using a pre-computed human reference genome in combination with
vg giraffe
to do the mapping.