prophyle / prophasm

ProphAsm – a rapid computation of simplitigs directly from k-mer sets
MIT License
26 stars 6 forks source link

Request: presence/absence matrix #10

Open SionBayliss opened 3 years ago

SionBayliss commented 3 years ago

Great tool! I have a quick feature request. An option to output a presence/absence matrix or similar output would be very useful.

On a similar note, do you know of any software that would achieve this in a scalable manner for application to large prokaryotic datasets?

karel-brinda commented 3 years ago

Hi Sion, thanks for using ProphAsm! How exactly do you envision this presence/absence matrix to look like? Would it be N x N matrix of files with simplitigs from shared k-mers across pairs of genomes?

SionBayliss commented 3 years ago

Hi Karel, thanks for the quick reply. I was thinking of something similar but with simpltigs x sample showing simplitig presence/absence in each sample. If this info is not calculated in prophasm it might be better achieved using external programs. I have used untig-caller to achieve this (which relies on suffix arrays (FM-index) provided by SeqAn3). Maybe a description in the README of this process would suffice?

anwarMZ commented 2 years ago

@karel-brinda any thoughts on adding this feature? something like unitig-caller would be great. @SionBayliss did you find another tool that can generate the matrix?

SionBayliss commented 2 years ago

@anwarMZ It was a long time ago that I worked on this so I will summarise what my (incomplete) notes say I did at the time. Please note that this was most likely insufficient for a complete analysis but might be useful to you to expand upon in the future.

I used unitig-caller to convert the workflow to one I was familiar with:

# starting with a unitig file generated from my complete bacterial genome collection using bcalm (I think) 
mkdir simplitig_pop
prophasm -i ./unitigs_unitigs.fasta -k 31 -o simplitig_pop/simplitigs.fa -s simplitig_pop/k-mer_stats.txt

# make into single line fasta - expected format
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < simplitig_pop/simplitigs.fa > simplitig_pop/simplitigs.fa1 && mv simplitig_pop/simplitigs.fa1 simplitig_pop/simplitigs.fa

# call unitigs using unitig caller - input if a list of two columns (col1 = name, col2 = path)
# this --simple flag uses string matching
unitig-caller --simple --refs unitig.named.list --unitigs simplitig_pop/simplitigs.fa --output simplitig_pop/counts

# homebrew script to convert to a PA matrix
../../scripts/convert_kmer_to_tab -i ./simplitig_pop/counts_pyseer.txt -o ./simplitig_pop/counts_pyseer.rtab -s sample.list # output in pyseer format

I believe that the limitation here was that simplitigs are effectively concatenations of unitigs (a stochastic choice of a single path in a De Brujin graph constructed from k-mers) and as such might not actually exist in any samples in your collection (or might exist but be interrupted by another genetic element). Simple string matching may not be sufficient to identify the presence of a simplitig in a sample unless it includes partial matching as it might be comprised of two or more unitigs.@karel-brinda might be able to correct me or expand upon this.

Hope that helped!