Consider switching host removal to graph based reference

nf-core / taxprofiler

Highly parallelised multi-taxonomic profiling of shotgun short- and long-read metagenomic data

https://nf-co.re/taxprofiler

MIT License

108 stars 33 forks source link

Consider switching host removal to graph based reference #297

Open jfy133 opened 1 year ago

jfy133 commented 1 year ago

Description of feature

As originally raised by @d4straub on Slack, a big issue in metagenomics/microbiome research is insufficient removal of host sequences from libraries, with many public data uploads containing individual-identifiable sequences.

In my opinion (shared with others), one of the biggest causes of this is suboptimal reference genomes which do not capture the whole diversity of host reference genomes.

One solution to this is to map against reference graphs that can contain SNPs from more than one individuals/populations.

@subwaystation as a pangenome expert suggests using a pre-computed human reference genome in combination with vg giraffe to do the mapping.

jfy133 commented 1 year ago

https://github.com/vgteam/vg/wiki/Mapping-short-reads-with-Giraffe#mapping-with-vg-giraffe

subwaystation commented 1 year ago

Hi James et al. :)

My suggestion would be that you take the HPRC CHM13 minigraph-cactus pangenome graph and map your reads against it using vg giraffe. As far as I know there are pre-build indices available at https://github.com/human-pangenomics/hpp_pangenome_resources#minigraphcactus. However, they may not be compatible with the current vg version. I couldn't find any documentation which index version fits which vg version. Ideally, it is possible for vg giraffe to report reads with low mapping quality. Or reads that multimap, because then we would have to drop these, too. But I lack experience here.

Hopefully @jeizenga can elobarate more. Else I will bug him in the US personally :P Note that Jordan will also give a tutorial on read mapping with vg giraffe in the US, which might be more up-to-date than your link @jfy133. Will keep you posted.

One question from my side: Is there already a data set for benchmarking, or how would you evaluate this @d4straub?

jeizenga commented 1 year ago

As far as I know, there are up-to-date indexes at that link. If you want to find multimapping reads, you can use the -M argument in vg giraffe. If you want low MAPQ reads, you can pipe the results through vg filter -q {max mapq} -U -.

subwaystation commented 1 year ago

Ah, that's good news, thanks @jeizenga!

@jfy133 @d4straub Do you now know a way forward? Happy to discuss this again in person.

jfy133 commented 1 year ago

I think we would need to experiment first, so it'll be a process! But this is some very useful first pointers bother for taxprofiler and also mag etc. - thanks both!

Midnighter commented 1 year ago

One thing I really like about taxprofiler is that it supports long reads, too. Is there anything comparable to giraffe for long reads?

jeizenga commented 1 year ago

The vg giraffe developers are working on long read mapping, but it's not mature or stable yet. There's also GraphAligner, which is pretty good for noisy long reads (ONT <= R9, PacBio CLR) but not as great for accurate long reads (ONT >= R10, PacBio HiFi). You could also look at minigraph, but it's only appropriate for graphs that have primarily long nodes, i.e. ones that don't include point variation. There are also some experimental long read features with vg mpmap --nt-type DNA --read-length long.

subwaystation commented 1 year ago

To complete the short-read mapper list: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad320/7160913?login=false. However, the tool can only start from VCF and not from GFA. So you would not be able to use the HPRC graphs. Maybe worth to monitor how the tool develops. There already is an issue open https://github.com/thomas-buechler-ulm/gedmap/issues/1.

d4straub commented 1 year ago

Very interesting discussion! I think we do need to evaluate this, but this seems more than just an afternoon of work.

Is there already a data set for benchmarking, or how would you evaluate this

I haven't researched this, but that is an important question. I stumbled across publications that evaluated de-contamination, but as usual its either synthetic datasets or the ground truth isnt known, as far as I recall.

subwaystation commented 1 year ago

To help you get started, you can check out @jeizenga's MemPanG23 pangenome read mapping tutorials at https://pangenome.github.io/MemPanG23/#_practical_course_central_time_zone.