treangenlab / emu

MIT License
27 stars 0 forks source link

Sequences of tax-id #9

Open paddyhooper opened 1 month ago

paddyhooper commented 1 month ago

Hi there, I am transferring over issue #34 from GitLab: "Sequences of tax-id". I wanted to know if there were any developments on getting sequence data for the taxa identified by Emu in a sample. This data would be really useful for phylogenetic analysis so I was wondering if there had been any developments for obtaining nucleotide information for the identified tax_ids?

Thanks!

_Copy of GitLab Issue #34:

Issue: Hello, I wanted to get the unique/representative sequences generated for taxonomic assignment so that I can do functional analysis with Picrust. is there any solution ? Thank you

Response: I see a couple ways of doing this. The first would be to extract the reference sequences from the database. Each database has a species_taxid.fasta file. Each sequence id in the fasta is in the format <tax_id>:<database name>:<counter>. Let's say you need all the sequences with tax_id: 100. You could grab all the sequences with that tax_id in the sequence id and generate a consensus sequence using another software.

However, there is one caveat here. Let's say there are 3 reference sequences for tax_id 100, let's call them strain a, b, and c. It is possible that all your reads in your sample align best to strain a. In this case, it would likely be better to use only the strain a sequence rather than a consensus of all 3. Currently, Emu is set up such that you will not be able to distinguish which reference sequence for a given tax_id gave the best alignment. If you need this distinction for your analysis, we will have to come up with a hack. Alternatively, you could create a database such that you only have one reference sequence for each tax_id.

Let me know if the described above does not meet your needs and we can think about other options._

rjain1990 commented 3 weeks ago

I am also following this. How to retrieve the sequence of the assigned taxa for further analysis such as Picrust or phylogeny. @paddyhooper did you find a solution or a method to do so? It would be really helpful to know this.

paddyhooper commented 3 weeks ago

Hi @rjain1990 , I haven't found any solutions at the moment. Any suggestions from the Emu team would be greatly appreciated!

kdc10 commented 2 days ago

An algorithm for this is actively being developed and tested.

In the meantime, here is potential algorithm:

  1. Use the keep-read-assignments flag. Emu does not assign each read as a single species. Instead it gives a probability distribution for each read. This parameter will create an additional file with said probability distributions.
  2. Gather all reads for each species tax id. Some thought will need to go into deciding which reads go into which species bucket, i.e. does a read need 50% likelihood or just majority to be classified as a given species, etc?
  3. Create a consensus sequence from all reads in each species bucket.

This will take some coding effort and some decision making. Let me know if this is along the lines of what you are looking for or if you have any questions.