vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 192 forks source link

Handling repeat regions #2957

Open brettChapman opened 3 years ago

brettChapman commented 3 years ago

Hi

I have a question in regards to how VG handles repeat regions.

My genome is around 80% repeats, based on some papers I've read and my results with Repeat Masker.

I'm working with a 20 genome pangenome as well, which would be roughly the same level of repeats give or take 10%.

I've recently consulted the author of Minimap2, and he recommends I don't align with masked genomes, as it will screw up the alignment.

Cactus on the other hand, handles soft-masked repeats and expects them to be masked this way.

I know from experience that when performing variant calls, its often prudent to hard-mask the genome prior to aligning genomic reads to avoid misalignment and later incorrect variant calls.

If I were to align genomic reads to my 20 pangenome graph, or a simple graph of one genome, should I be masking the genome prior to putting it in my genome graph? If I later intend on aligning short Illumina reads (high and low coverage) say from WGS data, for later variant calls direct from the graph would it be advisable to mask or not to mask? And if I do soft-mask prior to alignment will VG ignore soft-masked regions, or take into consideration the soft-masked regions when aligning?

Thank you for clarification on this.

ekg commented 3 years ago

I wouldn't use soft masking in vg. The alignment algorithms aren't set up to respect that. It's a similar situation to what you found in minimap2.

I honestly wouldn't worry too much about masking repeats. Your reads will get low mapping quality, and you can remove them on that basis. However, some mappings to the repeats will be unique due to particular variants that uniquely identify them, and it'd be a shame to throw out those.

An alternative would be to build the pangenome so that the repeats collapsed. Then basically be aligning your reads against canonical versions of each repeat, with variants embedded. But, the only mapper that should be able to handle this is vg giraffe and I'm not sure it's been tested enough on complex graphs (even though in principle it should work on them). GraphAligner might also be able to, but it won't work well with short reads. vg map will run, but it might be painfully slow.

How are you building the pangenome?

brettChapman commented 3 years ago

Hi Erik

Thanks for your feedback.

I have 20 barley genomes and more to come, all de novo assembled and orientated into pseudomolecules, so no contigs or scaffolds, only chr1...chr7 and chrUn. I'm using Cactus to align all the genomes based on a guide tree (see attached), which will generate a HAL file, and then I'll process with Seqwish and then convert it into VG format. Based on how Cactus runs and on the Cactus paper, it accepts a soft-masked genome, so I've been masking the genomes with Repeat Masker using soft-masking. I assume Cactus still aligns through the soft-masked regions, but treats those regions differently. I'll need to consult with the Cactus team on exactly how it treats soft-masked regions.

We also have many low-coverage and high-coverage reads from other varieties (not assembled) I would like to align to the pangenome graph to identify further variants in those un-assembled varieties. Based on what you said, I could do so without much concern for repeats.

In addition to this we have small separate studies, where I'll likely use minimap2 or GSAlign on a few genomes before pulling them into VG.

Thanks.

pangenome_tree_bootstrap

ekg commented 3 years ago

I'm working on an approach using a fork of mashmap (edyeet) to require long alignments (50-100kb) with a given level of identity between the sequences. The output feeds into seqwish. Then, to compress and reduce local complexity in the graph I apply POA to approximately collinear segments of the graph using smoothxg.

I'll make a repo to document the approach and parameters that seem to work well in my tests. If you get set up for evaluations of the different graphs, I'd be interested in how this works out for you. Happy to talk by email.

brettChapman commented 3 years ago

Thanks. I'm aware of Mashmap. Edyeet looks interesting. Is it possible to specify multiple references and query sequences, like an all to all mapping, or do you need to select 1 reference and align all other query genomes to it, then concat the paf files before feeding into seqwish?

Seeing as it's quite fast, I'd be interested to see how the resulting pangenome graph compares, to say using minimap2 alignments, GSAlign (which is also quite fast and produces a VCF file) or a multiple sequence alignment done using Cactus.

Email works for me, I'd be happy to discuss further once I have some graphs to compare.

ekg commented 3 years ago

The idea is to do all vs all. For 20 genomes it shouldn't take very long to do. Email me, or what's your email?

On Wed, Aug 19, 2020, 09:59 Brett Chapman notifications@github.com wrote:

Thanks. I'm aware of Mashmap. Edyeet looks interesting. Is it possible to specify multiple references and query sequences, like an all to all mapping, or do you need to select 1 reference and align all other query genomes to it, then concat the paf files before feeding into seqwish?

Seeing as it's quite fast, I'd be interested to see how the resulting pangenome graph compares, to say using minimap2 alignments, GSAlign (which is also quite fast and produces a VCF file) or a multiple sequence alignment done using Cactus.

Email works for me, I'd be happy to discuss further once I have some graphs to compare.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2957#issuecomment-675916989, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJI7YDCY3Y5WRJAUV3SBOA4ZANCNFSM4QCVXPPA .

brettChapman commented 3 years ago

Would I first concat all the genomes? I assume I'd need to rename all the fasta headers beforehand though to uniquely identify different chromosomes from different varieties in the alignment?

So something like this:

"rename all fasta headers to identify varieties"
cat *.fasta > all.fasta
edyeet -X all.fasta all.fasta > pangenome.paf

Could the same be done with the likes of minimap2 as well for comparison? or GSAlign? GSAlign was specifically developed for same species alignments, so it's something I'd like to test out as well.

My email is brett.chapman@murdoch.edu.au

Thanks.

ekg commented 3 years ago

Yeah, that'd be the idea. The sequences should have unique names, ideally prefixed by some species or individual identifier.

I updated the edyeet documentation to clarify a little the main idea.

When working with minimap2, I found it essential to keep alignments longer than some threshold. For yeast, I found 10kb to be useful. For human genomes, 50kb may be required. In effect, this limit tends to remove short spurious repeats that will cause excessive collapse in the induced variation graph.

With the mashmap algorithm used in edyeet, alignment candidates are found by a scanning minhash approach. The query is broken into fragments of a given length and mappings for these are found above a given approximate threshold of identity given by the jaccard metric over minimizers between the query and target. Here, it's this possible to define a minimum alignment length and identity up front, rather than by filter.

A call like this edyeet -s 10000 -p 85 -a 90 -n 10 -X seq.fa seq.fa >aln.paf will produce base-level alignments for the best 10 mapping candidates with >85% estimated mapping identity and >90% identity during base-level alignment for non-overlapping 10kb windows in the sequence set. Adjusting these four parameters seems to be sufficient to define the scope of the pangenome graph that is constructed.

After construction, I am applying the smoothxg algorithm I mentioned earlier. Other vgteam people are applying this to cactus based graphs to ensure that they are properly collapsed locally. If this POA step works correctly, then we should be able to get MAF out of a sorted GFA using the maffer tool. But that isn't super well tested.

Happy to continue by email for more detail, but I thought this would make a useful public note.

egoltsman commented 3 years ago

Hi, I jumping into this thread because I'm trying to do something similar on my end, and edyeet sounded like too attractive of an option to pass on. I was able so run edyeet on a set of four assemblies, all of chromosome 1 from four individuals of the same species. It produced what seems like a reasonable set of alignments, although the tags in the paf file are different from what I get with minimap2. When I feed this file to seqwish it produces a GFA file that contains no edges, i.e. there's a single segment line for each of my input sequences but no graph as such. Just in case there is an issue with the paf file format I'm including a few lines from it here.

Thanks!

Ref_CHR1        75071545        0       10000   +       Bd1_1_CHR1      76515857        9848    19742   9884    10000   31      id:f:0.999141   ma:i:9884       mm:i:0  ni:i:7  nd:i:10 ns:i:109        ed:i:126        al:i:10010      se:f:0.0125874  cg:Z:18=1I28=1D7=3D34=1D14=1D21=1D35=2D9=1I3=1I28=2I9=1I3=1D6=1I9669=109I
Ref_CHR1        75071545        10000   20000   +       Bd1_1_CHR1      76515857        19851   29757   9905    10000   43      id:f:0.99995    ma:i:9905       mm:i:0  ni:i:0  nd:i:1  ns:i:95 ed:i:96 al:i:10001      se:f:0.00959904 cg:Z:2355=1D7550=95I
Ref_CHR1        75071545        0       10000   +       Bd21_3_CHR1     75545850        3312    13287   9966    10000   27      id:f:0.997847   ma:i:9966       mm:i:4  ni:i:30 nd:i:5  ns:i:0  ed:i:39 al:i:10005      se:f:0.00389805 cg:Z:4=1I7=1D21=1D7=1D7=1D6=1I8=1D12=1I19=1X6=1I6=1I5=1X18=4I1=2I10=1I2=1I7=4I1=1X1=1I2=1I6=1I21=2I19=1I20=1I1=2I1=3I4=1I1060=1X8684=
Ref_CHR1        75071545        20000   30000   +       Bd1_1_CHR1      76515857        29852   39805   9952    10000   40      id:f:0.9999     ma:i:9952       mm:i:1  ni:i:0  nd:i:0  ns:i:47 ed:i:48 al:i:10000      se:f:0.0048     cg:Z:7674=1X2278=47I
Ref_CHR1        75071545        90000   100000  +       Bd1_1_CHR1      76515857        99852   109673  9820    10000   40      id:f:0.999898   ma:i:9820       mm:i:1  ni:i:0  nd:i:0  ns:i:179        ed:i:180        al:i:10000      se:f:0.018      cg:Z:3805=1X6015=179I
Ref_CHR1        75071545        30000   40000   +       Bd1_1_CHR1      76515857        39852   49822   9970    10000   255     id:f:1  ma:i:9970       mm:i:0  ni:i:0  nd:i:0  ns:i:30 ed:i:30 al:i:10000      se:f:0.003      cg:Z:9970=30I
Ref_CHR1        75071545        120000  130000  +       Bd1_1_CHR1      76515857        129849  139848  9999    10000   255     id:f:1  ma:i:9999       mm:i:0  ni:i:0  nd:i:0  ns:i:1  ed:i:1  al:i:10000      se:f:0.0001     cg:Z:9999=1I
Ref_CHR1        75071545        90000   100000  +       Bd21_3_CHR1     75545850        93286   103107  9820    10000   40      id:f:0.999898   ma:i:9820       mm:i:1  ni:i:0  nd:i:0  ns:i:179        ed:i:180        al:i:10000      se:f:0.018      cg:Z:3119=1X6701=179I
Ref_CHR1        75071545        110000  120000  +       Bd1_1_CHR1      76515857        119849  129645  9796    10000   255     id:f:1  ma:i:9796       mm:i:0  ni:i:0  nd:i:0  ns:i:204        ed:i:204        al:i:10000      se:f:0.0204     cg:Z:9796=204I
Ref_CHR1        75071545        130000  140000  +       Bd21_3_CHR1     75545850        133260  143166  9905    10000   43      id:f:0.99995    ma:i:9905       mm:i:0  ni:i:0  nd:i:1  ns:i:95 ed:i:96 al:i:10001      se:f:0.00959904 cg:Z:6551=1D3354=95I
ekg commented 3 years ago

Let's debug this further on the edyeet or seqwish GitHub. Please restart the issue there. I don't immediately see the problem. Are you filtering the alignments by length?

On Fri, Aug 21, 2020, 23:56 Eugene Goltsman notifications@github.com wrote:

Hi, I jumping into this thread because I'm trying to do something similar on my end, and edyeet sounded like too attractive of an option to pass on. I was able so run edyeet on a set of four assemblies, all of chromosome 1 from four individuals of the same species. It produced what seems like a reasonable set of alignments, although the tags in the paf file are different from what I get with minimap2. When I feed this file to seqwish it produces a GFA file that contains no edges, i.e. there's a single segment line for each of my input sequences but no graph as such. Just in case there is an issue with the paf file format I'm including a few lines from it here.

Thanks!

Ref_CHR1 75071545 0 10000 + Bd1_1_CHR1 76515857 9848 19742 9884 10000 31 id:f:0.999141 ma:i:9884 mm:i:0 ni:i:7 nd:i:10 ns:i:109 ed:i:126 al:i:10010 se:f:0.0125874 cg:Z:18=1I28=1D7=3D34=1D14=1D21=1D35=2D9=1I3=1I28=2I9=1I3=1D6=1I9669=109I Ref_CHR1 75071545 10000 20000 + Bd1_1_CHR1 76515857 19851 29757 9905 10000 43 id:f:0.99995 ma:i:9905 mm:i:0 ni:i:0 nd:i:1 ns:i:95 ed:i:96 al:i:10001 se:f:0.00959904 cg:Z:2355=1D7550=95I Ref_CHR1 75071545 0 10000 + Bd21_3_CHR1 75545850 3312 13287 9966 10000 27 id:f:0.997847 ma:i:9966 mm:i:4 ni:i:30 nd:i:5 ns:i:0 ed:i:39 al:i:10005 se:f:0.00389805 cg:Z:4=1I7=1D21=1D7=1D7=1D6=1I8=1D12=1I19=1X6=1I6=1I5=1X18=4I1=2I10=1I2=1I7=4I1=1X1=1I2=1I6=1I21=2I19=1I20=1I1=2I1=3I4=1I1060=1X8684= Ref_CHR1 75071545 20000 30000 + Bd1_1_CHR1 76515857 29852 39805 9952 10000 40 id:f:0.9999 ma:i:9952 mm:i:1 ni:i:0 nd:i:0 ns:i:47 ed:i:48 al:i:10000 se:f:0.0048 cg:Z:7674=1X2278=47I Ref_CHR1 75071545 90000 100000 + Bd1_1_CHR1 76515857 99852 109673 9820 10000 40 id:f:0.999898 ma:i:9820 mm:i:1 ni:i:0 nd:i:0 ns:i:179 ed:i:180 al:i:10000 se:f:0.018 cg:Z:3805=1X6015=179I Ref_CHR1 75071545 30000 40000 + Bd1_1_CHR1 76515857 39852 49822 9970 10000 255 id:f:1 ma:i:9970 mm:i:0 ni:i:0 nd:i:0 ns:i:30 ed:i:30 al:i:10000 se:f:0.003 cg:Z:9970=30I Ref_CHR1 75071545 120000 130000 + Bd1_1_CHR1 76515857 129849 139848 9999 10000 255 id:f:1 ma:i:9999 mm:i:0 ni:i:0 nd:i:0 ns:i:1 ed:i:1 al:i:10000 se:f:0.0001 cg:Z:9999=1I Ref_CHR1 75071545 90000 100000 + Bd21_3_CHR1 75545850 93286 103107 9820 10000 40 id:f:0.999898 ma:i:9820 mm:i:1 ni:i:0 nd:i:0 ns:i:179 ed:i:180 al:i:10000 se:f:0.018 cg:Z:3119=1X6701=179I Ref_CHR1 75071545 110000 120000 + Bd1_1_CHR1 76515857 119849 129645 9796 10000 255 id:f:1 ma:i:9796 mm:i:0 ni:i:0 nd:i:0 ns:i:204 ed:i:204 al:i:10000 se:f:0.0204 cg:Z:9796=204I Ref_CHR1 75071545 130000 140000 + Bd21_3_CHR1 75545850 133260 143166 9905 10000 43 id:f:0.99995 ma:i:9905 mm:i:0 ni:i:0 nd:i:1 ns:i:95 ed:i:96 al:i:10001 se:f:0.00959904 cg:Z:6551=1D3354=95I

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2957#issuecomment-678533083, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELAOR6XMWN4QWQ3CWLSB3UPNANCNFSM4QCVXPPA .