uoguelph-mlrg / PROTAX-GPU

GPU-accelerated DNA barcode classification
Other
8 stars 1 forks source link

Usage instructions? #24

Closed hjarnek closed 1 month ago

hjarnek commented 1 month ago

Hi,

This looks super-cool, but do you plan to publish any instructions on how to use it also? :)

Another small question. You state in your paper for PROTAX-GPU that...

[...] before computing sequence distances, following [13], the query and reference sequences are first translated into amino acid sequences, then aligned using a hidden Markov model (HMM), and finally back-translated into nucleotide sequences.

What's the idea with this step? How does that work with non-protein-coding genes, like rRNA genes, and how does it account for different genetic translation tables?

Cheers

gwtaylor commented 1 month ago

Hi @hjarnek, thank you for your interest in PROTAX-GPU and for reaching out!

Documentation Updates

We acknowledge that our current documentation is lacking, and we're actively working on improving it. The lead developer is in the process of transitioning dependencies from the Chex library to Flax to better integrate with the JAX ecosystem. This change aims to streamline future development and enhance performance. We delayed updates to the documentation during this transition but expect to release comprehensive instructions soon, irrespective of our progress on Flax.

Translation and Alignment Steps

Regarding your question about the translation and alignment steps mentioned in our paper:

"[...] before computing sequence distances, following [13], the query and reference sequences are first translated into amino acid sequences, then aligned using a hidden Markov model (HMM), and finally back-translated into nucleotide sequences."

Purpose of Translation and Back-Translation

This approach is commonly used for protein-coding genes like the cytochrome c oxidase I (COI) gene. Here's why:

You can find more details on this method on pages 42 and 44 of this guide.

Applicability to Non-Protein-Coding Genes

For non-protein-coding genes like rRNA genes or the Internal Transcribed Spacer (ITS) regions:

Genetic Translation Tables

Regarding different genetic translation tables:

Alignment and PROTAX-GPU

It's important to note:

gwtaylor commented 1 month ago

Closing as duplicate of #13

hjarnek commented 1 month ago

Ok, thanks for the quick answer @gwtaylor. Looking forward to being able to test this program out.

About the genetic codes: When working with environmental/metagenomic samples as PROTAX is targeted for, and mitochondrial protein-coding marker genes like COI, you almost always have a wide mix of different genetic codes present in the sample, the most common being vertebrates (code 2), invertebrates (code 5), and protozoans (code 4). Not accounting for that seems a bit rough, but I'm not sure how it could be done either. The translational differences between these codes are small, so it might not make much of a difference for the alignment, and the protein translation-backtranslation might still help with the nucleotide alignment. But that's not a given, and uncorrected for it introduces some bias, which just makes me uncomfortable.

gwtaylor commented 1 month ago

I understand your concern. The issue did not come up for us in the two datasets we tested since FinPROTAX is an arthropod dataset and, to my knowledge, the 7.8M sequences extracted from BOLD were mostly or all invertebrates. For working with environmental/metagenomic samples with a wider mix of genetic codes I would ask @psomervuo to comment.