Open xujialupaoli opened 5 months ago
Hi @xujialupaoli
Unfortunately, PECAT currently does not support outputting the gfa format. XXX_tiles
(primary_tiles
and alternate_tiles
or haplotype_1_tiles
and haplotype_2_tiles
) records how contig is composed of reads. It can be restored to the simplified string graph. For example, ctg44 edge=1357298:E~1091400:B
means a edge of the contig ctg44
. Its corresponding overlap 1091400 1357298 79.3962 -59447 1 34649 94106 94106 0 44139 103576 103576
can be found in the file filter.m4a
. 1357298
and 1091400
are the reads. Their original names can be found in 0-prepare/id2name.gz
`
Thank you so much for your quick and detailed response.!
Your answer is very inspiring to me.
I want to know the relationship between /5-assemble/graph_paths.gz and /5-assemble/graph_edges.gz. Is graph_edges.gz the simplified path of the string graph after removing redundant paths such as loops and low-quality edges?
I want to generate a gfa file with rich information and paths by myself. Would it be better to build the gfa file through string_graph and path_graph in the 5-assemble process?
I understand that in the graph_edges.gz file, such as 946:B 2825:E 18035 19291 18020 74.164 transitive
, the first and second columns should correspond to the initial node and the end node. What is the information in the third to fifth columns?
I understand the information in the graph_paths.gz file, such as
2~0_91589~0_359066~0_359066~0_40403 148462 233898 0_91589->0_359066->0_40403
2~0_541259~0_759759~0_535113~0_632200 ptransitive 169198 690082 0_541259->0_759759->0_535113->0_632200
I really want to know what2~
stands for.
I also want to know whether ->0_759759->
represents the connection of nodes 759759
in a positive order(0_
)?
In addition, what information does ptransitive 169198 690082
represent? Does it mean the path length?
I look forward to your reply, thank you very much!
graph_edges.gz
records the edges in the string graph. The edges in simplified graph are marked as active. graph_paths.gz
records another form of the graph, in which multipe edges are combined to a path for simplification. It records some intermediate states for debugging. I suggest generating the gfa file from XXX_tiles or active edges in graph_edges.gz
. The former is the graph composed of the contigs, while the latter is slightly more complex.
18035 19291 18020 74.164
is overlap information, such as start position, identity, etc.
Thank you very much for providing such a useful tool!
I successfully used it to assemble the genome of a diploid orchid. I want to get the gfa file of the diploid assembly, similar to the hifiasm_r_utg.fa output by hifiasm. I checked your source code and tried to output graph files, but I did not find any intermediate files related to the node base sequences. I only found contig_paths.gz graph_edges.gz graph_paths.gz id2name.txt.gz in the folder 5-assemble. I would like to ask you how I can get the gfa file, or other similar graph files, for my subsequent analysis and path visualization.
I look forward to your reply, thank you very much!