harvardinformatics / degenotate

MIT License
40 stars 4 forks source link

Extract Conserved 4D sites from alignment #48

Closed simone-says closed 4 months ago

simone-says commented 4 months ago

This isn't an issue, but a question on functionality- is there a way to extract conserved 4D sites only from an MSA FASTA file of aligned CDS? The sequences of 4D sites I'm getting with degenotate are different lengths (so no longer aligned), and re-aligning would likely just create more gaps.

gwct commented 4 months ago

So you are running degenotate separately for each sequence in your alignments? Then yes, the 4D sites will be different based on the different annotations. We hadn't thought of using alignments as input since this is more of a tool for single genomes (outside of the MK tests). Currently, the only way around that would be to re-align them, though I'm not sure how well that would work since you've artificially removed sequence.

What would really be better, and may be something we could implement in the future, would be to let users input MSAs and have them assign a "reference" sequence. We could then use its annotations to extract columns of the alignment in which the reference site is 4D, though they may not be 4D in all species as your example demonstrates. But this would at least preserve the alignment and get sites homologous to the 4D ones in the reference species.

If that idea fits with your needs, a thing you could do in the meantime would be to separately extract 4D sites from whatever your reference species is in the alignment, remove that species from the alignment, and then add the 4D sites only back to the alignment using something like the --add option with MAFFT. Then you could trim your alignment to be only those where the 4D sites align.

Hopefully this helps, and I'm happy to discuss more.

simone-says commented 4 months ago

That's exactly what I was hoping for. Right now I'm using CDS alignments from genes extracted from WGS, but ideally I would like to extract homologous 4D sites from a MAF whole genome alignment using a reference annotation file. I could convert to FASTA, that part is straightforward, but the ability to extract homologous 4d sites for tree building and calculating phylogenetic distance is what I was looking for. I'll try extracting sites from the reference separately and adding in with MAFFT for now, thank you!

gwct commented 4 months ago

Yea I think that's the way to go for now. The more I think about implementing this in degenotate the more complicated it seems, but I may do it anyways sometime.

If your starting point is a MAF you might also want to check out msa_view within the PHAST package. It can extract 4d sites from the reference sequence of a MAF, but still may not be convenient for your purposes. For one, I don't think it outputs to FASTA format, so you'd have to convert from their .ss format somehow. Second, they just classify a 4-fold site as any 3rd positions of codons, which isn't exactly right. But maybe there's a way to make it work for your needs.

simone-says commented 4 months ago

Oh wow I didn't know PHAST was just extracting 3rd codon positions, I was trying to avoid using it because you have to chop up the MAF and GFF file to be per chromosome, and converting from their weird SS format is annoying. I don't think most people know that seeing how much PHAST is cited in the literature for extracting 4D sites, thank you! I'll go ahead and close this issue.

gwct commented 4 months ago

I stand corrected, msa_view does seem to take into account the genetic code.

Just wanted to clear that up in case someone reads this issue in the future.