marschall-lab / panacus

Panacus is a tool for computing statistics for GFA-formatted pangenome graphs
MIT License
85 stars 5 forks source link

command is not supported for more than 65534 #13

Closed ld9866 closed 10 months ago

ld9866 commented 10 months ago

Dear developer: We are conducting the collection test of our real data according to the example data, but we have encountered some problems and hope to get your help. The errors are as follows. Best day!

step1 is ok

grep '^P' test.giffa2.1.0.gfa | cut -f2 | grep -ve 'refernece' > test.giffa2.paths.haplotypes.txt

step2 is erro

RUST_LOG=info /home/test/Software/panacus-0.2.3_linux_x86_64/bin/panacus histgrowth -t 4 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -S -a -s test.giffa2.paths.haplotypes.txt test.giffa2.1.0.gfa > test.giffa2.histgrowth.node.tsv

erro

[2023-11-30T00:41:26Z INFO panacus::cli] running panacus on 4 threads [2023-11-30T00:41:26Z INFO panacus::cli] constructing indexes for node/edge IDs, node lengths, and P/W lines.. [2023-11-30T00:43:19Z INFO panacus::cli] ..done; found 383935 paths/walks and 174028496 nodes [2023-11-30T00:43:19Z INFO panacus::cli] loading data from group / subset / exclude files [2023-11-30T00:43:19Z INFO panacus::abacus] loading coordinates from pig.giffa2.paths.haplotypes.txt Error: Custom { kind: Unsupported, error: "data has 383917 path groups, but command is not supported for more than 65534" }

danydoerr commented 10 months ago

Yes, that's right-- at the moment the tool is limited to 65534 path groups (speak "samples" or "taxa"). I did not find it likely that there are data sets with more distinct samples/taxa out there right now. How many samples does your data set have?

Typically, you want to group your paths into samples or haplotypes, but this requires that path names adhere to the PanSN naming scheme. Then, you can simply group by sample (-S) or haplotype (-H)

danydoerr commented 10 months ago

Oh, and if your paths are not PanSN compatible, you can still do the grouping by hand, by specifying a path-to-group mapping with -g

ld9866 commented 10 months ago

Thank you for getting back to me so quickly. In fact, we only have 27 samples, and the genome size of each sample is 2.5G, so it should not be a problem for human pan-genome to visualize our data. We used minigraph-cactus for pan-genome construction and then used vg to convert gfa1.1 format for visual analysis, I would like to ask how we should conduct quality control or other operations to complete the visualization. Best yours.

danydoerr commented 10 months ago

Ok, then this means that you need to group the paths by samples (-S) or haplotypes (-H). Regarding quality control, I think panacus is a good starting point, here is my suggestion:

  1. Generate an HTML page that contains coverage histograms+growth curves for all count types:
    RUST_LOG=info panacus histgrowth -t4 -l 1,2,1,1,1 -q 0,0,1,0.5,0.1 -H -c all -a -o html test.giffa2.1.0.gfa > test.giffa2.histgrowth.all.html
  2. I find the coverage plots very insightful for quality control. Typically, you expect that the two highest bars correspond to coverage by a single sample/haplotype and by all samples/haplotypes, respectively. Anything else indicates that you might want to re-consider your alignment parameters
  3. I find the node-resolved coverage table extremely helpful for checking some basic properties of pangenome graphs, especially in combination with node length information (see script gfa2nodelen.py.zip). The table can be generated with
    RUST_LOG=info panacus table -t4 -H -c node test.giffa2.1.0.gfa > test.giffa2.coverage.node.tsv
  4. I am a bit surprised that you have 170 mio. nodes in your graph, given a genome size of 2.5Gbp per sample. For comparison, the HPRC+chinese human pangenome graph (also generated with minigraph-cactus) contains 211 haplotypes, each with ~2.7Gbp length has only about 119 mio. nodes. Now, this does not necessarily mean that your graph has poor quality, the number of nodes depends very much also on the diversity of the genomes. The large number of nodes might make the analysis that I propose (see 3.) a bit more resource-demanding, but typical HPCs nowadays should be able to deal with these large tables.
danydoerr commented 10 months ago

If you have further questions on QC of your pangenome graph, please email me at daniel.doerr@hhu.de

ld9866 commented 10 months ago

OK!I will send the detailed information to your email for consultation! With best wishes