marschall-lab / project-male-assembly

HGSVC SIG: targeted chromsome Y assembly
MIT License
8 stars 1 forks source link

Comparing fasta sequences #11

Closed pilleh closed 2 years ago

pilleh commented 2 years ago

@ptrebert Btw, were you serious about a one-liner for comparing 2 fasta sequences? Can you write this (can be multiple lines as well :P) in like few minutes? It would be even better if it could take as input multiple fasta sequences (an alignment essentially) and then output the site across all sequences + position in the sequence in case there is a difference between the sequences. Maybe in a similar format to vcf - one site per row: position, gt/sample1, gr/sample1 etc. I can kind of think of a way of writing it but in reality it would take me more time than I would like.

ptrebert commented 2 years ago

can you prep a (mock) example file(s)? A vcf-like format only makes sense if all sequences share the same coordinate system (i.e., all have the same length, or, by convention, all start at base 1/position 0 at least). What sequences do you want to compare?

pilleh commented 2 years ago

@ptrebert Okay, so one place where I would use it is to compare the chrYs of completely assembled samples (well, the main, PAR1-PAR2 contig really). For example HG00358 is complete and one such sample. I can share with you a file where I've just kept the PAR1-PAR2 sequences? These contigs have identical length so I kind of assume them to be identical but it would be good to check.

I also want to align palindrome arms to see if I can roughly identify the inversion breakpoints. I'll prepare a small mock one now, so essentially it would be a multiple seq alignment. Ideally they will all have the same length. But let me make a mock file.

pilleh commented 2 years ago

7samples_P8_arms_names_mafft_fsta.txt

This is an alignment in fasta format, as an example. I also wonder - there must be existing tools that would just keep the variable sites from an alignment, with the location info in the sequence... It's just that I don't know which.

ptrebert commented 2 years ago

Because I couldn't sleep, I cobbled something together (see workflow/scripts/compfa/compare_fa_aln.py). The script only requires Python3 to run, no third-party libraries. It prints a tabular summary, by default omitting all positions where all sequences are identical. Sequences can be in a single or multiple files, unequal length should also work, but only makes sense if the joint enumeration starting at the first base is meaningful. The script includes a few parameters to adjust behavior, see --help

For your example file, the script called like this

./compare_fa_aln.py -i 7samples_P8_arms_names_mafft_fsta.txt --first-index 0 --quiet --uppercase

produces output like that:

image

Let me know if you need any changes.

pilleh commented 2 years ago

I didn't realise you not sleeping well means that you get up and do stuff. It must be quite annoying! But thanks a lot for this, this is awesome, it works beautifully! :)

ptrebert commented 2 years ago

please reopen if changes are needed