Closed hjarnek closed 1 month ago
Hi @hjarnek, thank you for your interest in PROTAX-GPU and for reaching out!
Documentation Updates
We acknowledge that our current documentation is lacking, and we're actively working on improving it. The lead developer is in the process of transitioning dependencies from the Chex library to Flax to better integrate with the JAX ecosystem. This change aims to streamline future development and enhance performance. We delayed updates to the documentation during this transition but expect to release comprehensive instructions soon, irrespective of our progress on Flax.
Translation and Alignment Steps
Regarding your question about the translation and alignment steps mentioned in our paper:
"[...] before computing sequence distances, following [13], the query and reference sequences are first translated into amino acid sequences, then aligned using a hidden Markov model (HMM), and finally back-translated into nucleotide sequences."
Purpose of Translation and Back-Translation
This approach is commonly used for protein-coding genes like the cytochrome c oxidase I (COI) gene. Here's why:
You can find more details on this method on pages 42 and 44 of this guide.
Applicability to Non-Protein-Coding Genes
For non-protein-coding genes like rRNA genes or the Internal Transcribed Spacer (ITS) regions:
Genetic Translation Tables
Regarding different genetic translation tables:
Alignment and PROTAX-GPU
It's important to note:
Closing as duplicate of #13
Ok, thanks for the quick answer @gwtaylor. Looking forward to being able to test this program out.
About the genetic codes: When working with environmental/metagenomic samples as PROTAX is targeted for, and mitochondrial protein-coding marker genes like COI, you almost always have a wide mix of different genetic codes present in the sample, the most common being vertebrates (code 2), invertebrates (code 5), and protozoans (code 4). Not accounting for that seems a bit rough, but I'm not sure how it could be done either. The translational differences between these codes are small, so it might not make much of a difference for the alignment, and the protein translation-backtranslation might still help with the nucleotide alignment. But that's not a given, and uncorrected for it introduces some bias, which just makes me uncomfortable.
I understand your concern. The issue did not come up for us in the two datasets we tested since FinPROTAX is an arthropod dataset and, to my knowledge, the 7.8M sequences extracted from BOLD were mostly or all invertebrates. For working with environmental/metagenomic samples with a wider mix of genetic codes I would ask @psomervuo to comment.
Hi,
This looks super-cool, but do you plan to publish any instructions on how to use it also? :)
Another small question. You state in your paper for PROTAX-GPU that...
What's the idea with this step? How does that work with non-protein-coding genes, like rRNA genes, and how does it account for different genetic translation tables?
Cheers