Closed ld9866 closed 9 months ago
Besides, we found the example file "hprc-v1.0-mc-grch38.gfa" is also the version gfa1.0. could you give me a example to visualize the MC result with version gfa 1.1?
l found the first line of "hprc-v1.0-mc-grch38.gfa" is H VN:Z:1.0, but our MC result is H VN:Z:1.1 RS:Z:Reference1. I don't know if that's the reason for the problem
The difference between v1.0 and v1.1 is that the latter also supports W-lines. However, panacus does not alter its behavior depending on different GFA versions. So, no, the GFA header line does not matter. In fact, panacus will still process GFA files even if they do not contain that header line.
Our problem is that the path file obtained in the first step is empty, but we can extract the path file after using vg to convert to version 1.0, I would like to ask why?
If the original GFA file is v1.1, I'd guess paths are encoded in W-lines:
grep -e '^W' hprc-v1.0-mc-grch38.gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' >> hprc-v1.0-mc-grch38.paths.txt
Now we have found a problem, according to the example you provided, we can not extract the results of MC for visual analysis, including #4.
The description in #4 for hprc-v1.0-mc-grch38.gfa is obsolete, because panacus now provides shortcuts for grouping (-H
, '-S'), so you do not need to do that by hand. The following code replaces all of #4:
grep -ive 'grch38\|chm13' hprc-v1.0-mc-grch38.paths.txt > hprc-v1.0-mc-grch38.paths.haplotypes.txt
RUST_LOG=info panacus histgrowth -t16 -q 0,1,0.5,0.1 -S -s hprc-v1.0-mc-grch38.paths.haplotypes.txt hprc-v1.0-mc-grch38.gfa > hprc-v1.0-mc-grch38.histgrowth.node.txt
panacus-visualize hprc-v1.0-mc-grch38.histgrowth.node.txt > hprc-v1.0-mc-grch38.histgrowth.node.pdf
... and to produce a plot quantifying the non-reference sequence of the graph:
grep -ie 'grch38' hprc-v1.0-mc-grch38.paths.txt > hprc-v1.0-mc-grch38.paths.grch38.txt
RUST_LOG=info panacus histgrowth -t16 -c bp -q 0,1,0.5,0.1 -e hprc-v1.0-mc-grch38.paths.grch38.txt -S -s hprc-v1.0-mc-grch38.paths.haplotypes.txt hprc-v1.0-mc-grch38.gfa > hprc-v1.0-mc-grch38.histgrowth.nogrch38.bp.txt
panacus-visualize hprc-v1.0-mc-grch38.histgrowth.nogrch38.bp.txt > hprc-v1.0-mc-grch38.histgrowth.nogrch38.bp.pdf
Thank you for your positive reply. We will get back to you after the test. Best
Dear developer: The code "RUST_LOG=info panacus histgrowth -t16 -c bp -q 0,1,0.5,0.1 -e hprc-v1.0-mc-grch38.paths.grch38.txt -S -s hprc-v1.0-mc-grch38.paths.haplotypes.txt hprc-v1.0-mc-grch38.gfa > hprc-v1.0-mc-grch38.histgrowth.nogrch38.bp.txt" this step takes a long time, but nothing is generated in the code, what is the reason for this? log file: [2023-12-04T00:17:57Z INFO panacus::cli] running panacus on 16 threads [2023-12-04T00:17:57Z INFO panacus::cli] constructing indexes for node/edge IDs, node lengths, and P/W lines.. [2023-12-04T00:20:55Z INFO panacus::cli] ..done; found 12817 paths/walks and 188902154 nodes [2023-12-04T00:20:55Z INFO panacus::cli] loading data from group / subset / exclude files [2023-12-04T00:20:55Z INFO panacus::abacus] loading coordinates from pig.full.paths.haplotypes.txt [2023-12-04T00:20:56Z INFO panacus::abacus] loading coordinates from pig.full.paths.Sscrofa.txt [2023-12-04T00:20:56Z INFO panacus::cli] loading graph from pig.full.gfa [2023-12-04T00:20:56Z INFO panacus::abacus] parsing path + walk sequences [2023-12-04T04:36:40Z INFO panacus::abacus] counting abacus entries.. [2023-12-04T04:36:52Z INFO panacus::cli] abacus has 26 path groups and 188902155 countables [2023-12-04T04:36:52Z INFO panacus::cli] constructing histogram.. [2023-12-04T04:36:53Z INFO panacus::hist] calculating growth for coverage >= 1A and quorum >= 0R [2023-12-04T04:36:53Z INFO panacus::hist] calculating growth for coverage >= 1A and quorum >= 1R [2023-12-04T04:36:53Z INFO panacus::hist] calculating growth for coverage >= 1A and quorum >= 0.5R [2023-12-04T04:36:53Z INFO panacus::hist] calculating growth for coverage >= 1A and quorum >= 0.1R [2023-12-04T04:36:53Z INFO panacus::cli] reporting (hist-)growth table [2023-12-04T04:37:43Z INFO panacus] done; time elapsed: 15586.143364131s
Hi, do you mind sharing the output file hprc-v1.0-mc-grch38.histgrowth.nogrch38.bp.txt
in this thread?
The code is similar to your suggestion grep -e '^W' pig.full.gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' >> pig.full.paths.txt
# grep -ive 'Sscrofa' pig.full.paths.txt > pig.full.paths.haplotypes.txt
RUST_LOG=info panacus histgrowth -t 16 -q 0,1,0.5,0.1 -S -s pig.full.paths.haplotypes.txt pig.full.gfa > pig.full.histgrowth.node.txt
panacus-visualize pig.full.histgrowth.node.txt > pig.full.histgrowth.node.pdf
# grep -ie 'Sscrofa' pig.full.paths.txt > pig.full.paths.Sscrofa.txt
RUST_LOG=info panacus histgrowth -t 16 -c bp -q 0,1,0.5,0.1 -e pig.full.paths.Sscrofa.txt -S -s pig.full.paths.haplotypes.txt pig.full.gfa > pig.full.histgrowth.noref.bp.txt
panacus-visualize pig.full.histgrowth.noref.bp.txt > pig.full.histgrowth.noref.bp.pdf
Thanks, I'll have a look at it now.
Regarding the analysis on the hprc-v1.0-mc-grch38.gfa
graph:
The plot looks fine, except that it seems the GRCh38 paths are still included. If you followed the above code, I suppose that's my mistake, because in this graph, the GRCh38 paths are encoded in P-lines, which I did not extract into the paths file. My guess is that your current file hprc-v1.0-mc-grch38.paths.txt
is empty.
Here is the code to correct for that. If you re-run panacus with the updated file, your graph should look more what you expected:
grep '^P' hprc-v1.0-mc-grch38.gfa | cut -f2 > hprc-v1.0-mc-grch38.paths.txt
grep -e '^W' hprc-v1.0-mc-grch38.gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' >> hprc-v1.0-mc-grch38.paths.txt
grep -ie 'grch38' hprc-v1.0-mc-grch38.paths.txt > hprc-v1.0-mc-grch38.paths.grch38.txt
Now, regarding the analysis on your pig data set:
Again, the plots and tables look just fine, except for pig.full.histgrowth.noref.bp.pdf
where I expected less amount of sequence, since it shouldn't include the reference. I suspect that here the issue is the same as with the HPRC data set, i.e., the file with the reference paths is empty. Is this also your concern, or is there something else that you do not find plausible with the results?
Btw, I'd recommend adding -a
to the panacus histrowth
calls, so that you also get the coverage histograms.
Again, the plots and tables look just fine, except for
pig.full.histgrowth.noref.bp.pdf
where I expected less amount of sequence, since it shouldn't include the reference. I suspect that here the issue is the same as with the HPRC data set, i.e., the file with the reference paths is empty. Is this also your concern, or is there something else that you do not find plausible with the results?
Ah scratch that. I also can't find any fault in pig.full.histgrowth.noref.bp.pdf
, the amount of sequence shown seems to be much smaller than the total genome length, so it looks good to me. If you want to discuss the plot in more detail, let's do this via mail (daniel.doerr@hhu.de)
Dear developer: The problem is "produce a plot quantifying the non-reference sequence of the graph", The problem we have now is that we are having some problems drawing non-reference sequences, and the results don't look very good, how do we solve this?
What makes you believe that "the results don't look very good"? I'm looking at the plot pig.full.histgrowth.noref.bp.pdf
and I see that the sequence of the first genome is at about 100Mbp. If you expect the genome to have an average length of 2.7Gb, then this indicates to me that a lot of sequence is excluded, so (for all I can see, given the data that you provided) the filter for non-reference sequence has worked.
However, I found that the non-reference sequence would reach 1.5Gb, which I think is unlikely, logically at 300Mb
Ok, that's a good starting point. Can you quantify the total length (in bp) of your graph?
I uploaded our pig data, could you please help us see what is wrong with it.
Yes, that's what I am trying to do, but the data that you provide is not sufficient to see where the issue is. Can you please also provide the file paths file to me?
grep '^P' pig.full.gfa | cut -f2 > pig.full.paths.txt
grep -e '^W' pig.full.gfa | cut -f2-6 | awk '{ print $1 "#" $2 "#" $3 ":" $4 "-" $5 }' >> pig.full.paths.txt
code: grep -ie 'Sscrofa' pig.full.paths.txt > pig.full.paths.Sscrofa.txt RUST_LOG=info panacus histgrowth -t 16 -c bp -q 0,1,0.5,0.1 -e pig.full.paths.Sscrofa.txt -S -s pig.full.paths.haplotypes.txt pig.full.gfa > pig.full.histgrowth.noref.bp.txt panacus-visualize pig.full.histgrowth.noref.bp.txt > pig.full.histgrowth.noref.bp.pdf
pig.full.histgrowth.ref.bp.pdf pig.full.histgrowth.ref.bp.txt
The file names of your upload do not agree with the command that you typed.
Thanks. The paths names are all fine, panacus
recognizes that they are given in PanSN format. Again, there is no evidence that your analysis with panacus has failed. If you quantify the total length (bp) of your graph, and then subtract the length of your reference sequence (~2.5Gb?) you should be able to validate this independently.
We were delighted to hear that the data was fine, and actually what we wanted to do was to make a graph of generic sequence growth, so that it would be a better illustration of our pan-genome being more representative. Now we are using the GFA result of Minigraph-cactus as input, how can we do that? As you said above, the node count is correct. The question now is how to get a visual result right.
The question now is how to get a visual result right.
What's wrong with the visualization?
Dear developer: I think I understand what you mean, now we are on the right track of visualization may not be a problem, thank you for your reply, I re-think our results, because our results involve content publishing issues, so I delete part of the results hope you can understand. Thank you very much for your help.
I perfectly understand. Again, we can discuss things in private via email, if that helps.
Dear developer: Now we have found a problem, according to the example you provided, we can not extract the results of MC for visual analysis, including https://github.com/marschall-lab/panacus/issues/4. Our problem is that the path file obtained in the first step is empty, but we can extract the path file after using vg to convert to version 1.0, I would like to ask why? code: grep '^P' test.full.gfa | cut -f2 > test.paths.txt