Obtain path sequences as specified in P-lines

dirkjanvw commented 1 year ago

Hi, thanks for providing this great toolkit!

I found this toolkit to be the only one I could find that is able to provide the fasta sequence for a path. The only thing is, I have to manually specify this. Would it be possible to supply the PathName of a path (or by default obtain the sequence for all paths)? My current workaround for one path is:

pathname="G1S1" #name of the path
pathsegments=$(awk -vpathname=${pathname} 'BEGIN{FS = OFS = "\t";} $1=="P" && $2==pathname {print $3;}' test.gfa) #obtaining the order of segments for the path
gfatk path test.gfa "${pathsegments}" | awk -vpathname=${pathname} '/^>/{$1=">"pathname} {print;}' #use gfatk for obtaining the sequence and change fasta header back to name of the path

However, ideally I think it should be possible to say something like gfatk path test.gfa --all > all_paths.fa, if P-lines are specified in the GFA format. (I have not found any tool that can do this but yours seems closest.) What are your thoughts on this?

Euphrasiologist commented 1 year ago

Hello! Glad you find this tool useful :)

Yes, that is a bit of an annoying workaround I admit. I never got around to implementing P segments in the toolkit, as I wasn't using them at the time. That's the only reason they are not in there.

I think you are right, this should be easy enough to implement it. I'll take a look and update the code. A nice API might be:

// for all paths like you say
gfatk path test.gfa --all > all_paths.fa 
// for a specific named path
gfatk path test.gfa --name G1S1 > G1S1.fa

What do you think? Then a user can get a specific named path.

dirkjanvw commented 1 year ago

That looks perfect! It would be great to have it work like that :)

My specific usecase is that I would like to see whether GFA files constructed from specific sequences still contain the exact same sequences I used as input.

Euphrasiologist commented 1 year ago

I've added an implementation of this (not the --name bit yet). You'll have to compile until I get it updated on conda etc, hope that's okay!

dirkjanvw commented 1 year ago

Thanks a lot, this works for me!

Maybe another thing to consider is that the fasta header of large GFA files becomes really large as --all now puts all segments and their orientation in the fasta header. I'd say it is acceptable to have the PathName to be the only thing in the fasta header when obtaining all paths (they are likely very long paths, otherwise one could just simply write the segments by hand).

Euphrasiologist commented 1 year ago

Great - can change that. Out of interest how long are your paths on average?

dirkjanvw commented 1 year ago

I am working with compressed de Bruijn graphs, so I recently created a bacterial cDBG with one genome for which I built a GFA file. The path for this bacterial sequence had >900,000 segments, so already using less on the P-line in the GFA file was super slow. In the end, I want to create cDBGs for higher plants and animals and I expect the graphs to easily contain tens of millions of segments per sequence (in this case, per chromosome).

Euphrasiologist commented 1 year ago

Woah, that's waaay(!!) bigger than I had thought. Do let me know what the performance is like, that would be interesting! I've really barely made any optimisations, so lots of room for improvement.

dirkjanvw commented 1 year ago

I already saw that in the code but I was surprised by the speed! Probably in a couple of weeks, I'll try it out on properly large sets such as the plant and animal sets, but this is the speeds on my two "small" test GFA files:

% cut -f 1 tmp.gfa | sort | uniq -c   #to get some stats on the GFA file
   1 H
 161 L
   4 P
 121 S
% awk '$1=="P"{n=split($3,a,","); print $2,n;}' tmp.gfa   #to get some stats on the number of segments per path
G1S1 172
G2S1 148
G3S1 160
G4S1 149
% time gfatk path -a tmp.gfa > tmp.gfa.fa   #time the actual command
gfatk path -a tmp.gfa > tmp.gfa.fa  0.00s user 0.00s system 76% cpu 0.010 total

% cut -f 1 prokka.gfa | sort | uniq -c   #to get some stats on the GFA file
   1 H
877962 L
   1 P
574074 S
% awk '$1=="P"{n=split($3,a,","); print $2,n;}' prokka.gfa   #to get some stats on the number of segments per path
G1S1 916827
% time gfatk path -a prokka.gfa > prokka.gfa.fa   #time the actual command
gfatk path -a prokka.gfa > prokka.gfa.fa  2.13s user 0.21s system 99% cpu 2.347 total

I'll update you later on the large sets!

Euphrasiologist commented 1 year ago

Thanks for the info! Super helpful - faster than I thought too, so that's something!

dirkjanvw commented 1 year ago

I just ran it on an Arabidopsis thaliana cDBG built from 8 genomes. These are some stats:

$ cut -f 1 arabidopsis.gfa | sort | uniq -c    #to get some stats on the GFA file
      1 H
18763264 L
    844 P
11726287 S
$ time gfatk path -a arabidopsis.gfa | fold -w60 > arabidopsis.fa    #time the actual command
real    348m27.145s
user    332m5.310s
sys     16m27.271s
$ seqkit stats arabidopsis.fa    #to get some stats on the resulting FASTA file
file            format  type  num_seqs      sum_len  min_len      avg_len     max_len
arabidopsis.fa  FASTA   DNA        844  962,397,596       86  1,140,281.5  30,427,671

This is clearly starting to push the limits, but it's still surpassing my expectations!

Euphrasiologist commented 1 year ago

That's a huge GFA! I'm glad to see it still has some utility on those. So it's actually not so slow on single paths (<30s) but you have over 800 there!

Thanks for these benchmarks, I'll probably put them in the README!

tolkit / gfatk

Obtain path sequences as specified in P-lines #14