marschall-lab / panacus

Panacus is a tool for computing statistics for GFA-formatted pangenome graphs
MIT License
73 stars 4 forks source link

How to Visualize the results of the minigraph-cactus? #14

Closed ld9866 closed 9 months ago

ld9866 commented 9 months ago

Dear developer: Now we have found a problem, according to the example you provided, we can not extract the results of MC for visual analysis, including https://github.com/marschall-lab/panacus/issues/4. Our problem is that the path file obtained in the first step is empty, but we can extract the path file after using vg to convert to version 1.0, I would like to ask why? code: grep '^P' test.full.gfa | cut -f2 > test.paths.txt

ld9866 commented 9 months ago

Besides, we found the example file "hprc-v1.0-mc-grch38.gfa" is also the version gfa1.0. could you give me a example to visualize the MC result with version gfa 1.1?

ld9866 commented 9 months ago

l found the first line of "hprc-v1.0-mc-grch38.gfa" is H VN:Z:1.0, but our MC result is H VN:Z:1.1 RS:Z:Reference1. I don't know if that's the reason for the problem

danydoerr commented 9 months ago

The difference between v1.0 and v1.1 is that the latter also supports W-lines. However, panacus does not alter its behavior depending on different GFA versions. So, no, the GFA header line does not matter. In fact, panacus will still process GFA files even if they do not contain that header line.

Our problem is that the path file obtained in the first step is empty, but we can extract the path file after using vg to convert to version 1.0, I would like to ask why?

If the original GFA file is v1.1, I'd guess paths are encoded in W-lines:

grep -e '^W' hprc-v1.0-mc-grch38.gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' >> hprc-v1.0-mc-grch38.paths.txt

Now we have found a problem, according to the example you provided, we can not extract the results of MC for visual analysis, including #4.

The description in #4 for hprc-v1.0-mc-grch38.gfa is obsolete, because panacus now provides shortcuts for grouping (-H, '-S'), so you do not need to do that by hand. The following code replaces all of #4:


grep -ive 'grch38\|chm13' hprc-v1.0-mc-grch38.paths.txt > hprc-v1.0-mc-grch38.paths.haplotypes.txt

RUST_LOG=info panacus histgrowth -t16 -q 0,1,0.5,0.1 -S -s hprc-v1.0-mc-grch38.paths.haplotypes.txt hprc-v1.0-mc-grch38.gfa > hprc-v1.0-mc-grch38.histgrowth.node.txt

panacus-visualize hprc-v1.0-mc-grch38.histgrowth.node.txt >  hprc-v1.0-mc-grch38.histgrowth.node.pdf

... and to produce a plot quantifying the non-reference sequence of the graph:


grep -ie 'grch38' hprc-v1.0-mc-grch38.paths.txt > hprc-v1.0-mc-grch38.paths.grch38.txt

RUST_LOG=info panacus histgrowth -t16 -c bp -q 0,1,0.5,0.1 -e hprc-v1.0-mc-grch38.paths.grch38.txt -S -s hprc-v1.0-mc-grch38.paths.haplotypes.txt hprc-v1.0-mc-grch38.gfa > hprc-v1.0-mc-grch38.histgrowth.nogrch38.bp.txt

panacus-visualize hprc-v1.0-mc-grch38.histgrowth.nogrch38.bp.txt >  hprc-v1.0-mc-grch38.histgrowth.nogrch38.bp.pdf
ld9866 commented 9 months ago

Thank you for your positive reply. We will get back to you after the test. Best

ld9866 commented 9 months ago

Dear developer: The code "RUST_LOG=info panacus histgrowth -t16 -c bp -q 0,1,0.5,0.1 -e hprc-v1.0-mc-grch38.paths.grch38.txt -S -s hprc-v1.0-mc-grch38.paths.haplotypes.txt hprc-v1.0-mc-grch38.gfa > hprc-v1.0-mc-grch38.histgrowth.nogrch38.bp.txt" this step takes a long time, but nothing is generated in the code, what is the reason for this? log file: [2023-12-04T00:17:57Z INFO panacus::cli] running panacus on 16 threads [2023-12-04T00:17:57Z INFO panacus::cli] constructing indexes for node/edge IDs, node lengths, and P/W lines.. [2023-12-04T00:20:55Z INFO panacus::cli] ..done; found 12817 paths/walks and 188902154 nodes [2023-12-04T00:20:55Z INFO panacus::cli] loading data from group / subset / exclude files [2023-12-04T00:20:55Z INFO panacus::abacus] loading coordinates from pig.full.paths.haplotypes.txt [2023-12-04T00:20:56Z INFO panacus::abacus] loading coordinates from pig.full.paths.Sscrofa.txt [2023-12-04T00:20:56Z INFO panacus::cli] loading graph from pig.full.gfa [2023-12-04T00:20:56Z INFO panacus::abacus] parsing path + walk sequences [2023-12-04T04:36:40Z INFO panacus::abacus] counting abacus entries.. [2023-12-04T04:36:52Z INFO panacus::cli] abacus has 26 path groups and 188902155 countables [2023-12-04T04:36:52Z INFO panacus::cli] constructing histogram.. [2023-12-04T04:36:53Z INFO panacus::hist] calculating growth for coverage >= 1A and quorum >= 0R [2023-12-04T04:36:53Z INFO panacus::hist] calculating growth for coverage >= 1A and quorum >= 1R [2023-12-04T04:36:53Z INFO panacus::hist] calculating growth for coverage >= 1A and quorum >= 0.5R [2023-12-04T04:36:53Z INFO panacus::hist] calculating growth for coverage >= 1A and quorum >= 0.1R [2023-12-04T04:36:53Z INFO panacus::cli] reporting (hist-)growth table [2023-12-04T04:37:43Z INFO panacus] done; time elapsed: 15586.143364131s

danydoerr commented 9 months ago

Hi, do you mind sharing the output file hprc-v1.0-mc-grch38.histgrowth.nogrch38.bp.txt in this thread?

ld9866 commented 9 months ago

The code is similar to your suggestion grep -e '^W' pig.full.gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' >> pig.full.paths.txt

# grep -ive 'Sscrofa' pig.full.paths.txt > pig.full.paths.haplotypes.txt

RUST_LOG=info panacus histgrowth -t 16 -q 0,1,0.5,0.1 -S -s pig.full.paths.haplotypes.txt pig.full.gfa > pig.full.histgrowth.node.txt

panacus-visualize pig.full.histgrowth.node.txt > pig.full.histgrowth.node.pdf

# grep -ie 'Sscrofa' pig.full.paths.txt > pig.full.paths.Sscrofa.txt

RUST_LOG=info panacus histgrowth -t 16 -c bp -q 0,1,0.5,0.1 -e pig.full.paths.Sscrofa.txt -S -s pig.full.paths.haplotypes.txt pig.full.gfa > pig.full.histgrowth.noref.bp.txt

panacus-visualize pig.full.histgrowth.noref.bp.txt > pig.full.histgrowth.noref.bp.pdf

danydoerr commented 9 months ago

Thanks, I'll have a look at it now.

danydoerr commented 9 months ago

Regarding the analysis on the hprc-v1.0-mc-grch38.gfa graph:

The plot looks fine, except that it seems the GRCh38 paths are still included. If you followed the above code, I suppose that's my mistake, because in this graph, the GRCh38 paths are encoded in P-lines, which I did not extract into the paths file. My guess is that your current file hprc-v1.0-mc-grch38.paths.txt is empty.

Here is the code to correct for that. If you re-run panacus with the updated file, your graph should look more what you expected:

grep '^P' hprc-v1.0-mc-grch38.gfa | cut -f2 > hprc-v1.0-mc-grch38.paths.txt
grep -e '^W' hprc-v1.0-mc-grch38.gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' >> hprc-v1.0-mc-grch38.paths.txt
grep -ie 'grch38' hprc-v1.0-mc-grch38.paths.txt > hprc-v1.0-mc-grch38.paths.grch38.txt

Now, regarding the analysis on your pig data set:

Again, the plots and tables look just fine, except for pig.full.histgrowth.noref.bp.pdf where I expected less amount of sequence, since it shouldn't include the reference. I suspect that here the issue is the same as with the HPRC data set, i.e., the file with the reference paths is empty. Is this also your concern, or is there something else that you do not find plausible with the results?

danydoerr commented 9 months ago

Btw, I'd recommend adding -a to the panacus histrowth calls, so that you also get the coverage histograms.

danydoerr commented 9 months ago

Again, the plots and tables look just fine, except for pig.full.histgrowth.noref.bp.pdf where I expected less amount of sequence, since it shouldn't include the reference. I suspect that here the issue is the same as with the HPRC data set, i.e., the file with the reference paths is empty. Is this also your concern, or is there something else that you do not find plausible with the results?

Ah scratch that. I also can't find any fault in pig.full.histgrowth.noref.bp.pdf, the amount of sequence shown seems to be much smaller than the total genome length, so it looks good to me. If you want to discuss the plot in more detail, let's do this via mail (daniel.doerr@hhu.de)

ld9866 commented 9 months ago

Dear developer: The problem is "produce a plot quantifying the non-reference sequence of the graph", The problem we have now is that we are having some problems drawing non-reference sequences, and the results don't look very good, how do we solve this?

danydoerr commented 9 months ago

What makes you believe that "the results don't look very good"? I'm looking at the plot pig.full.histgrowth.noref.bp.pdf and I see that the sequence of the first genome is at about 100Mbp. If you expect the genome to have an average length of 2.7Gb, then this indicates to me that a lot of sequence is excluded, so (for all I can see, given the data that you provided) the filter for non-reference sequence has worked.

ld9866 commented 9 months ago

However, I found that the non-reference sequence would reach 1.5Gb, which I think is unlikely, logically at 300Mb

danydoerr commented 9 months ago

Ok, that's a good starting point. Can you quantify the total length (in bp) of your graph?

danydoerr commented 9 months ago

I uploaded our pig data, could you please help us see what is wrong with it.

Yes, that's what I am trying to do, but the data that you provide is not sufficient to see where the issue is. Can you please also provide the file paths file to me?

grep '^P' pig.full.gfa | cut -f2 > pig.full.paths.txt
grep -e '^W' pig.full.gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' >> pig.full.paths.txt

code: grep -ie 'Sscrofa' pig.full.paths.txt > pig.full.paths.Sscrofa.txt RUST_LOG=info panacus histgrowth -t 16 -c bp -q 0,1,0.5,0.1 -e pig.full.paths.Sscrofa.txt -S -s pig.full.paths.haplotypes.txt pig.full.gfa > pig.full.histgrowth.noref.bp.txt panacus-visualize pig.full.histgrowth.noref.bp.txt > pig.full.histgrowth.noref.bp.pdf

pig.full.histgrowth.ref.bp.pdf pig.full.histgrowth.ref.bp.txt

The file names of your upload do not agree with the command that you typed.

danydoerr commented 9 months ago

Thanks. The paths names are all fine, panacus recognizes that they are given in PanSN format. Again, there is no evidence that your analysis with panacus has failed. If you quantify the total length (bp) of your graph, and then subtract the length of your reference sequence (~2.5Gb?) you should be able to validate this independently.

ld9866 commented 9 months ago

We were delighted to hear that the data was fine, and actually what we wanted to do was to make a graph of generic sequence growth, so that it would be a better illustration of our pan-genome being more representative. Now we are using the GFA result of Minigraph-cactus as input, how can we do that? As you said above, the node count is correct. The question now is how to get a visual result right.

danydoerr commented 9 months ago

The question now is how to get a visual result right.

What's wrong with the visualization?

ld9866 commented 9 months ago

Dear developer: I think I understand what you mean, now we are on the right track of visualization may not be a problem, thank you for your reply, I re-think our results, because our results involve content publishing issues, so I delete part of the results hope you can understand. Thank you very much for your help.

danydoerr commented 9 months ago

I perfectly understand. Again, we can discuss things in private via email, if that helps.