marschall-lab / panacus

Panacus is a tool for computing statistics for GFA-formatted pangenome graphs
MIT License
77 stars 4 forks source link

Ordered growth for PGGB gfa file #33

Open OZTaekOppa opened 1 month ago

OZTaekOppa commented 1 month ago

Dear Panacus,

Thank you for developing such a great program.

I’m interested in testing the "Ordered Pangenome Growth Statistics" for a PGGB GFA file. However, the manual only provides guidance for minigraph-cactus. I attempted to run the "Ordered Pangenome Growth Statistics" for the PGGB GFA file after making a few modifications, but I wasn’t successful.

Here is the script I executed:

Step 1a: Establish order of samples in the growth statistics

echo 'HG01891 HG02109 HG02145 HG02257 HG02486 HG02559 HG02572 HG02622 HG02630 HG02717 HG02723 HG02818 HG02886 HG03098 HG03453 HG03486 HG03516 HG03540 HG03579 NA18906 NA19240 NA21309 HG00733 HG00735 HG00741 HG01071 HG01106 HG01109 HG01123 HG01175 HG01243 HG01358 HG01361 HG01928 HG01952 HG01978 HG02055 HG02148 NA20129 HG01258 NA24385 HG03492 HG00438 HG00621 HG00673 HG02080 NA24631 N002580 N002639 N003302 N006906 N006942 N008032 N008291 N008292 N008293 N008294' | tr ' ' '\n' > ${OUTPUT_DIR}/pggb_hprc.order.samples.txt

Step 1b: Exclude paths from reference genome CHM13 and GRCH38

grep '^P' ${INPUT_GFA} | cut -f2 | grep -ve 'chm13|grch38' > ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt

Step 2: panacus histgrowth to calculate coverage and pangenome growth for nodes

RUST_LOG=info panacus ordered-histgrowth -c bp -t6 -l 25,20,5,3,2 -S -e ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.tsv

Step 3: Visualize coverage histogram and pangenome growth curve with estimated growth parameters

panacus-visualize -e ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.tsv > ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.pdf

conda deactivate

Here is the log:

Is there a way to successfully run the "Ordered Pangenome Growth Statistics" for a PGGB GFA file?

Any feedback would be greatly appreciated!

Cheers,

Taek

heringerp commented 1 month ago

Hi Taek,

what exactly isn't working for you? Is the .tsv outputted by panacus ordered-histgrowth fine?

If you want to exclude the reference sequences from the graph you can create the paths file in the following way:

grep '^P' ${INPUT_GFA} | cut -f2 | grep -iE 'chm13|grch38' > ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt

(use grep -iE instead of grep -ve, since only paths that should be excluded need to be listed)

The SyntaxWarnings for the python script can be ignored, they are only warnings and do not affect the result (still #34 should fix them).

Best regards Peter

OZTaekOppa commented 1 month ago

Hi Peter,

Thank you for your prompt reply.

I followed your directions as outlined below.

Executed command (for PGGB gfa file): qsub -o /g/data/te53/t2t2024/logs/ -v INPUT_GFA=/community.1.fa.out/community.1.fa.smooth.final.gfa ./panacus_pggb.sh

Panacus_pggb.sh:

Step 1a: Establish the order of samples in the growth statistics

echo 'HG01891 HG02109 HG02145 HG02257 HG02486 HG02559 HG02572 HG02622 HG02630 HG02717 HG02723 HG02818 HG02886 HG03098 HG03453 HG03486 HG03516 HG03540 HG03579 NA18906 NA19240 NA21309 HG00733 HG00735 HG00741 HG01071 HG01106 HG01109 HG01123 HG01175 HG01243 HG01358 HG01361 HG01928 HG01952 HG01978 HG02055 HG02148 NA20129 HG01258 NA24385 HG03492 HG00438 HG00621 HG00673 HG02080 NA24631 N002580 N002639 N003302 N006906 N006942 N008032 N008291 N008292 N008293 N008294' | tr ' ' '\n' > ${OUTPUT_DIR}/pggb_hprc.order.samples.txt

Step 1b: Exclude paths from reference genome CHM13 and GRCH38

grep '^P' ${INPUT_GFA} | cut -f2 | grep -iE 'chm13|grch38' > ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt

Step 2: panacus histgrowth to calculate coverage and pangenome growth for nodes

RUST_LOG=info panacus ordered-histgrowth -c bp -t6 -l 20,20,10,3,2 -S -e ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.tsv

Step 3: Visualize coverage histogram and pangenome growth curve with estimated growth parameters

panacus-visualize -e ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.tsv > ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.pdf

While Step 1a generated {OUTPUT_DIR}/pggb_hprc.order.samples.txt, the “ordered-growth plot for #bp” still shows the X-axis as consecutive numbers (0 ~ 57), instead of reflecting the sample order.

Also, I would like to confirm the correct usage of the -l 20,20,10,3,2 option in Step 2. If {OUTPUT_DIR}/pggb_hprc.order.samples.txt is set correctly, I expect to see the growth for the last 10 samples (N002580 ~ N008294). Was my option correct for reflecting this growth?

Looking forward to your feedback!

Kind regards,

Taek

heringerp commented 1 month ago

Hi Taek,

hmm, for me the plot shows the sample names on the x-axis. What version of panacus are you using?

With regards to the sample order: if you want to use the order of the order.samples.txt file you need to specify it to panacus ordered-histgrowth with the -O parameter. The -l option sets the coverage values used for the different bars, i.e. it will show bars for coverage 20 (twice), coverage 10, coverage 3 and coverage 2. This means that in this calculation only basepairs are included that are visited by at least 20 haplotypes (or 10, 3, 2). See also for an explanation. The growth will still be shown for all samples not just the last 10.

Best regards Peter

OZTaekOppa commented 4 weeks ago

Hi Peter,

FYI, I am using the latest conda version (v0.2.3).

I tried following your suggestions, but something seems off.

Executed script:

Ordered Histgrowth

Step 1: Establish the order of samples in the growth statistics (ARF, AMR, EUR, SAS, EAS, NCIG)

echo 'HG01891 HG02109 HG02145 HG02257 HG02486 HG02559 HG02572 HG02622 HG02630 HG02717 HG02723 HG02818 HG02886 HG03098 HG03453 HG03486 HG03516 HG03540 HG03579 NA18906 NA19240 NA21309 HG00733 HG00735 HG00741 HG01071 HG01106 HG01109 HG01123 HG01175 HG01243 HG01358 HG01361 HG01928 HG01952 HG01978 HG02055 HG02148 NA20129 HG01258 NA24385 HG03492 HG00438 HG00621 HG00673 HG02080 NA24631 N002580 N002639 N003302 N006906 N006942 N008032 N008291 N008292 N008293 N008294' | tr ' ' '\n' > ${OUTPUT_DIR}/pggb_hprc.order.samples.txt

Step 2: Establish the exclude of samples in the growth statistics

echo 'chm13 grch38' | tr ' ' '\n' > ${OUTPUT_DIR}/pggb_hprc.exculde.samples.txt

Step 3: Exclude paths from reference genome CHM13 and GRCH38

grep '^P' ${INPUT_GFA} | cut -f2 | grep -iE 'chm13|grch38' > ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt

Step 4: panacus histgrowth to calculate coverage and pangenome growth for nodes

RUST_LOG=info panacus ordered-histgrowth -c bp -t6 -l 5,10,20,15,5 -S -e ${OUTPUT_DIR}/pggb_hprc.exculde.samples.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.tsv

Step 5: Visualize coverage histogram and pangenome growth curve with estimated growth parameters

panacus-visualize -e ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.tsv > ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.pdf

It worked fine up to Step 3, but Step 4 failed. After reading more about the "ordered-histgrowth" function, I tried a few variations.

• Command: -c bp -t6 -l 5,10,20,15,5 -S -e o The pggb.ordhstgrw.bp.tsv was generated, but the plot didn’t reflect the ordered names (X-axis) from Step 1.

• Command: -c bp -t6 -l 5,10,20,15,5 -S -e -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt o The program complained about the -e option. o Error: error: a value is required for '--exclude ' but none was supplied

• Command: -c bp -t6 -l 5,10,20,15,5 -S -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt o Error: error: unexpected argument 'community.1.fa.out/community.1.smooth.final.gfa' found

• Command: -c bp -t6 -l 5,10,20,15,5 -S -e ${OUTPUT_DIR}/pggb_hprc.exclude.samples.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt o Same error as above: o Error: error: unexpected argument 'community.1.fa.out/community.1.smooth.final.gfa' found

Any suggestions?

Cheers,

Taek

heringerp commented 4 weeks ago

Hi Taek,

okay, so the plot showing numbers instead of the sample names is an issue that is fixed in the version on GitHub, however the fix is not yet included in the most recent version on conda.

For the command in step 4 try something like:

RUST_LOG=info panacus ordered-histgrowth -c bp -t6 -l 5,10,20,15,5 -S -e ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.tsv

Best regards Peter

OZTaekOppa commented 4 weeks ago

Hi Peter,

Thanks so much for your help.

Unfortunately, the issue persists. The script runs fine, but the plot still shows numbers instead of the sample names from pggb_hprc.order.samples.txt.

Should I try reinstalling the program from scratch via GitHub, or could you update the Conda version to fix the issue?

What do you think?

Cheers,

Taek

heringerp commented 4 weeks ago

Hi Taek,

unfortunately, I don't think that I can update the conda version of the package, maybe @lucaparmigiani or @danydoerr can help?

In the meantime installing from scratch via GitHub should fix the issue for you.

Best regards Peter

OZTaekOppa commented 4 weeks ago

Hi Peter,

Thank you for your reply.

I tried these two commands after installing it via the "From binary release" option.

Command 1:

Step 4: panacus histgrowth to calculate coverage and pangenome growth for nodes

RUST_LOG=info ${PANACUS} ordered-histgrowth -c bp -t6 -l 5,10,20,15,5 -S -e ${OUTPUT_DIR}/pggb_hprc.exculde.samples.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.tsv

Warning: /panacus-0.2.3/bin/panacus-visualize -e /panacus/PGGB/Ordered/community.11.fa.smooth.final.pggb.ordhstgrw.bp.tsv /apps/python3/3.9.2/lib/python3.9/site-packages/scipy/init.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.26.4) warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "

Command 2:

Step 4: panacus histgrowth to calculate coverage and pangenome growth for nodes

RUST_LOG=info ${PANACUS} ordered-histgrowth -c bp -t6 -l 5,10,20,15,5 -S -e ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ordhstgrw.bp.tsv

Error: error: unexpected argument 'community.11.fa.out/community.1.smooth.final.gfa' found

Command 1 generated a plot, but the X-axis names are still numbers. It seems the program is not properly recognising the pggb_hprc.order.samples.txt. The only warning message was about the NumPy version.

Command 2 failed instantly and didn’t even generate the smooth.final.pggb.ordhstgrw.bp.tsv file. The error message was the same as before.

It would be really helpful if the Panacus team could fix this bug for the research community.

Kind regards,

Taek 

heringerp commented 3 weeks ago

Hi Taek,

could you try the "From repository" option? The binary release is unfortunately the same version as the conda version.

Best regards Peter

OZTaekOppa commented 3 weeks ago

Hi Peter,

This time was more productive.

I tried both Command 1 and Command 2, and only Command 2 worked. I used -l 2,4,6,15,10, and the plot displayed different color bar patterns nicely. I hope this is the correct way to visualize the plot—please let me know if it's not.

image

Also, it would be helpful for others if the manual provided more clarity on this visualisation issue.

Once again, thank you for your help!

Cheers,

Taek

heringerp commented 3 weeks ago

Hi Taek,

yes, that looks good.

Yes, there is currently work on a new release done, so that should fix the problem also for all other versions of panancus.

Best regards Peter

OZTaekOppa commented 2 weeks ago

Hi Peter,

I have a quick question.

After generating each community growth plot, I merged all the GFA files from PGGB into a single global GFA file to create a global genome growth plot. However, when I tried to use the same Panacus command for the ordered growth plot, I encountered an error.

[2024-08-30T08:04:01Z INFO panacus::graph] constructing indexes for node/edge IDs, node lengths, and P/W lines.. [2024-08-30T08:04:01Z INFO panacus::io] loading graph from /g/data/chrAll_ncig_hprc.smooth.final.gfa thread 'main' panicked at src/graph.rs:330:21: Segment with ID 1 occurs multiple times in GFA note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Do you have any suggestions for resolving this?

Thanks,

Taek

danydoerr commented 2 weeks ago

Concatenating GFA files doesn't work, because you will end up with different segments being assigned the same ID, and that leads to an invalid graph. You need to use odgi squeeze or vg combine for that.

OZTaekOppa commented 1 week ago

@danydoerr

Thank you for getting back to me.

Here’s what I’ve done so far:

  1. Merged individual GFA files (generated from PGGB) using vg combine with and without the -p option: vg combine smp1.gfa smp2.gfa > merged_all.gfa

  2. Executed Panacus for the merged merged_all.gfa: Exclude paths from reference genome CHM13 and GRCH38 grep '^P' ${INPUT_GFA} | cut -f2 | grep -iE 'chm13|grch38' > ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt

Unfortunately, it failed after 1-2 minutes without any specific error message.

  1. Manual check for the merged merged_all.gfa: grep '^P' ./merged_all.gfa | cut -f2 | grep -iE 'chm13|grch38' > ./test1.pggb.path.txt grep '^P' ./merged_all.gfa | cut -f2 | grep -ve 'chm13|grch38' > ./test2.pggb.path.txt

Both attempts generated empty text files.

Do you have any suggestions for resolving this?

Thanks.

Taek

danydoerr commented 1 week ago

Hey Taek, I am not sure if you can actually run vg combine on GFA files. I suspect you need to convert those first into vg's own file format. You can then also check the correctness of the combined graph with vg validate. -dany

OZTaekOppa commented 5 days ago

@danydoerr

Thank you for your reply.

I will test the conversion for all merged chromosome GFA files.

I have a question: For the -l option, I used 2, 6, and 49, with -q set to 0. My understanding is that HG002#1 and HG002#2 are treated as two haplotypes, and the -l 2 option means any two haplotypes (nodes or paths) are used to generate growth. This command produced a TSV file.

RUST_LOG=info ${PANACUS} ordered-histgrowth -c bp -t${PBS_NCPUS} -l 2,6,49 -S -e ${OUTPUT_DIR}/${BASEFILE}.pggb.path.txt -O ${OUTPUT_DIR}/pggb_hprc.order.samples.txt ${INPUT_GFA} > ${OUTPUT_DIR}/${BASEFILE}.pggb.ng1.ordhstgrw.bp.tsv

I want to check the correlation between CHM13 (each chromosome length) and the accumulated base pairs in the TSV file’s last row (-l 2). From what I understand, the last row should reflect the total accumulated length, which should be similar to the length of each CHM13 chromosome. Is that correct?

I encountered an error with the following command: For the -l option, I used 1, 2, and 54, with -q set to 0.

There was a numpy version issue in our default module, so I installed a separate Python environment with these versions: python=3.9.2, numpy=1.22.0, scipy=1.8.0, matplotlib=3.5.1, pandas=1.4.3, seaborn=0.12.0, and scikit-learn=1.1.1. However, I’m still getting the same error.

[2024-09-09T09:35:25Z INFO panacus::abacus] abacus has 57 path groups and 8701826 countables [2024-09-09T09:35:25Z INFO panacus::io] reporting ordered-growth table [2024-09-09T09:35:25Z INFO panacus::io] calculating ordered growth for coverage >= 2A and quorum >= 0R [2024-09-09T09:35:25Z INFO panacus::io] calculating ordered growth for coverage >= 1A and quorum >= 0R [2024-09-09T09:35:25Z INFO panacus::io] calculating ordered growth for coverage >= 54A and quorum >= 0R [2024-09-09T09:35:30Z INFO panacus] done; time elapsed: 70.026912399s

Is there any way to resolve this? Or does the Ordered growth require a minimum of -l 2? If yes, why is that?

Thanks.

Taek

heringerp commented 18 hours ago

Hi Taek,

the -l option describes the coverage needed for a feature to be included in the calculation. In the case of -l 2 it does not mean that only two haplotypes are used, but instead that only nodes (or edges) are used in the calculation (i.e. for the calculation of every single bar/row) that are visited by at least two haplotypes (and possibly more).

Yes, the last line should show the total number of features (nodes/edges/bps) of the whole graph.

No, ordered-growth does not require a minimum -l of 2, it should work with all values. Unfortunately, the issue you encountered is a bug with the -e parameter of panacus-visualize. I have created a PR for this (#36), but until it is merged you can avoid this by not using -e for panacus-visualize (no curve will be fitted).

Best regards Peter