pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
394 stars 44 forks source link

Question about changes with parameters with the latest pggb updates #90

Closed brettChapman closed 3 years ago

brettChapman commented 3 years ago

Hi

I've noticed with the more recent updates that a few parameters have changed their default values, namely these values: max_path_jump=100 (-j) max_edge_jump=0 (-e) block_id_min=0 (-I) block_ratio_min=0 (-R)

With the -I and -R parameters is it no longer necessary to apply a -I 0.9 -R 0.05, due to updates with smoothxg?

With path and edge jumps, have these been reduced heavily due to observations with the human pangenome work? I've been using 12000 and as high as 15000. My understanding is that lower values would pull in more variation. I'm currently testing -j 100 and -e 0 on a single gene region in barley, but I may not see much of a difference unless I work on a chromosome level (chromosome level pggb runs with -p 95 take weeks to complete with the barley pangenome (even longer on chr 7H, due to its high diversity), so its not feasible to run multiple quick tests). Thanks.

ekg commented 3 years ago

Yes, these were reduced because they appeared to introduce artifacts and make the POA problems harder.

Do use the most recent versions. A bug with consensus merging has necessitated that we disable that for the time being. This was hurting the smoothed graph not just the consensus ones.

On Thu, Apr 29, 2021, 05:07 Brett Chapman @.***> wrote:

Hi

I've noticed with the more recent updates that a few parameters have changed their default values, namely these values: max_path_jump=100 (-j) max_edge_jump=0 (-e) block_id_min=0 (-I) block_ratio_min=0 (-R)

With the -I and -R parameters is it no longer necessary to apply a -I 0.9 -R 0.05, due to updates with smoothxg?

With path and edge jumps, have these been reduced heavily due to observations with the human pangenome work? I've been using 12000 and as high as 15000. My understanding is that lower values would pull in more variation. I'm currently testing -j 100 and -e 0 on a single gene region in barley, but I may not see much of a difference unless I work on a chromosome level (chromosome level pggb runs with -p 95 take weeks to complete with the barley pangenome (even longer on chr 7H, due to its high diversity), so its not feasible to run multiple quick tests). Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/90, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEI7R3G42HAJC32TSQTTLDEQZANCNFSM43YQB7GA .

brettChapman commented 3 years ago

Hi Erik

I'm using an older version which I haven't updated since I completed building the barley pangenome graph the other month. I've since been trying to get the graph into sequenceTubeMap, and I've been running vg chunk across the graph to generate many graphs to query (as sequenceTubeMap can not handle such large graphs).

I'm intending to fork PGGB and include vg rna in the pggb script to include trancripts as a splice graph. I'm currently running RNA-seq analysis on each of the genomes and generating a transcript GTF, which I'll embed in the graph using vg rna. I'll update to the latest version once I'm ready to rerun the pipeline again. Hopefully those bugs with the consensus graph is resolved by then, otherwise I'll just leave out the consensus.

subwaystation commented 3 years ago

@brettChapman Where you able to resolve all your issues?

brettChapman commented 3 years ago

Thanks @ekg and @subwaystation, yes my issues are resolved. I'm now testing chromosomes 1H and 7H using a min match length of 79 for seqwish and upping my min block-length to 3xsegment length (it was 0 before), path jump to 100, edge jump to 0, n_secondary of 20 (20 genomes), poa of 1,4,6,2,26,1, block id ratio and block id min to 0. I've also tested using mashmap and producing dotplots, to get a feel of where the percent ID should be. I'm using 95% with a segment length of 1Mbp. I found this was a good choice, as it gives good coverage without breaking up into too many segments. Running mashmap at 98% I found I got far more segment breaks across the length of the chromosomes.

I was able to get through most of the PGGB run with these parameters within a few days. It's now been 10 days, and I'm up to the odgi viz step already. This is a significant improvement on run time (it took several weeks before). I'm going to have a look at how many snps/indels I get by comparison. I imagine there is a optimal balance somewhere in there between speed and accuracy of the outputs. Depending on how the output looks, I may lower the seqwish min length or try with a different min block length.

I'm also running with these same parameters on the entire genome (all 7 chromosomes) to test and see how far along it gets.

subwaystation commented 3 years ago

You finally got there, congrats! And thanks for using and testing PGGB!