Usage instructions? - Githubissues

hjarnek commented 1 month ago

Hi,

This looks super-cool, but do you plan to publish any instructions on how to use it also? :)

Another small question. You state in your paper for PROTAX-GPU that...

[...] before computing sequence distances, following [13], the query and reference sequences are first translated into amino acid sequences, then aligned using a hidden Markov model (HMM), and finally back-translated into nucleotide sequences.

What's the idea with this step? How does that work with non-protein-coding genes, like rRNA genes, and how does it account for different genetic translation tables?

Cheers

gwtaylor commented 1 month ago

Hi @hjarnek, thank you for your interest in PROTAX-GPU and for reaching out!

Documentation Updates

We acknowledge that our current documentation is lacking, and we're actively working on improving it. The lead developer is in the process of transitioning dependencies from the Chex library to Flax to better integrate with the JAX ecosystem. This change aims to streamline future development and enhance performance. We delayed updates to the documentation during this transition but expect to release comprehensive instructions soon, irrespective of our progress on Flax.

Translation and Alignment Steps

Regarding your question about the translation and alignment steps mentioned in our paper:

"[...] before computing sequence distances, following [13], the query and reference sequences are first translated into amino acid sequences, then aligned using a hidden Markov model (HMM), and finally back-translated into nucleotide sequences."

Purpose of Translation and Back-Translation

This approach is commonly used for protein-coding genes like the cytochrome c oxidase I (COI) gene. Here's why:

Conservation of Amino Acid Sequences: Amino acid sequences are more conserved than nucleotide sequences due to the redundancy of the genetic code. This means that aligning amino acid sequences can be more reliable.
Improved Alignment Accuracy: Translating nucleotide sequences into amino acids allows for better alignment using HMMs tailored for protein sequences.
Back-Translation: After alignment, the sequences are back-translated to nucleotides to retain the original genetic information for downstream analyses, like PROTAX-GPU.

You can find more details on this method on pages 42 and 44 of this guide.

Applicability to Non-Protein-Coding Genes

For non-protein-coding genes like rRNA genes or the Internal Transcribed Spacer (ITS) regions:

No Translation Step: Since these genes do not code for proteins, the concept of translating to amino acids doesn't apply.
Direct Nucleotide Alignment: These sequences are typically aligned directly at the nucleotide level using tools designed for non-coding DNA.
HMM Alignment: While HMMs can still be used for alignment, they are specifically designed for nucleotide sequences in this context.

Genetic Translation Tables

Regarding different genetic translation tables:

Customization Needed: The translation step would indeed need to be adapted if you're working with organisms that use alternative genetic codes.
Software Configuration: Alignment tools often allow you to specify the genetic code, ensuring accurate translation for various organisms.

Alignment and PROTAX-GPU

It's important to note:

Alignment Outside the Codebase: The alignment code we talked about above is not in our codebase. PROTAX-GPU assumes that your sequences are already aligned. The software focuses on taxonomic assignment rather than sequence alignment.
Preprocessing Required: You'll need to perform alignment using appropriate tools (e.g., MAFFT, MUSCLE) before inputting your data into PROTAX-GPU.

gwtaylor commented 1 month ago

Closing as duplicate of #13

hjarnek commented 1 month ago

Ok, thanks for the quick answer @gwtaylor. Looking forward to being able to test this program out.

About the genetic codes: When working with environmental/metagenomic samples as PROTAX is targeted for, and mitochondrial protein-coding marker genes like COI, you almost always have a wide mix of different genetic codes present in the sample, the most common being vertebrates (code 2), invertebrates (code 5), and protozoans (code 4). Not accounting for that seems a bit rough, but I'm not sure how it could be done either. The translational differences between these codes are small, so it might not make much of a difference for the alignment, and the protein translation-backtranslation might still help with the nucleotide alignment. But that's not a given, and uncorrected for it introduces some bias, which just makes me uncomfortable.

gwtaylor commented 1 month ago

I understand your concern. The issue did not come up for us in the two datasets we tested since FinPROTAX is an arthropod dataset and, to my knowledge, the 7.8M sequences extracted from BOLD were mostly or all invertebrates. For working with environmental/metagenomic samples with a wider mix of genetic codes I would ask @psomervuo to comment.

uoguelph-mlrg / PROTAX-GPU

Usage instructions? #24