How to generate a gfa file for diploid assembly based on an intermediate process file?

xujialupaoli commented 5 months ago

Thank you very much for providing such a useful tool！

I successfully used it to assemble the genome of a diploid orchid. I want to get the gfa file of the diploid assembly, similar to the hifiasm_r_utg.fa output by hifiasm. I checked your source code and tried to output graph files, but I did not find any intermediate files related to the node base sequences. I only found contig_paths.gz graph_edges.gz graph_paths.gz id2name.txt.gz in the folder 5-assemble. I would like to ask you how I can get the gfa file, or other similar graph files, for my subsequent analysis and path visualization.

I look forward to your reply, thank you very much!

lemene commented 5 months ago

Hi @xujialupaoli Unfortunately, PECAT currently does not support outputting the gfa format. XXX_tiles (primary_tiles and alternate_tiles or haplotype_1_tiles and haplotype_2_tiles) records how contig is composed of reads. It can be restored to the simplified string graph. For example, ctg44 edge=1357298:E~1091400:B means a edge of the contig ctg44. Its corresponding overlap 1091400 1357298 79.3962 -59447 1 34649 94106 94106 0 44139 103576 103576 can be found in the file filter.m4a. 1357298 and 1091400 are the reads. Their original names can be found in 0-prepare/id2name.gz `

xujialupaoli commented 5 months ago

Thank you so much for your quick and detailed response.！ Your answer is very inspiring to me. I want to know the relationship between /5-assemble/graph_paths.gz and /5-assemble/graph_edges.gz. Is graph_edges.gz the simplified path of the string graph after removing redundant paths such as loops and low-quality edges? I want to generate a gfa file with rich information and paths by myself. Would it be better to build the gfa file through string_graph and path_graph in the 5-assemble process? I understand that in the graph_edges.gz file, such as 946:B 2825:E 18035 19291 18020 74.164 transitive, the first and second columns should correspond to the initial node and the end node. What is the information in the third to fifth columns? I understand the information in the graph_paths.gz file, such as 2~0_91589~0_359066~0_359066~0_40403 148462 233898 0_91589->0_359066->0_40403 2~0_541259~0_759759~0_535113~0_632200 ptransitive 169198 690082 0_541259->0_759759->0_535113->0_632200 I really want to know what2~ stands for. I also want to know whether ->0_759759->represents the connection of nodes 759759 in a positive order(0_)? In addition, what information does ptransitive 169198 690082 represent? Does it mean the path length?

I look forward to your reply, thank you very much!

lemene commented 5 months ago

graph_edges.gz records the edges in the string graph. The edges in simplified graph are marked as active. graph_paths.gz records another form of the graph, in which multipe edges are combined to a path for simplification. It records some intermediate states for debugging. I suggest generating the gfa file from XXX_tiles or active edges in graph_edges.gz. The former is the graph composed of the contigs, while the latter is slightly more complex. 18035 19291 18020 74.164 is overlap information, such as start position, identity, etc.

lemene / PECAT

How to generate a gfa file for diploid assembly based on an intermediate process file? #29