smirarab / sepp-refs

GNU General Public License v3.0
5 stars 2 forks source link

Silva 138? #3

Open jwdebelius opened 4 years ago

jwdebelius commented 4 years ago

Is it possible to either get scripts to do the alignment or get the new Silva 138 release? Silva updates about annually and it would be really nice to be able to update things that rely on sepp and a consistent database along side that

ahalhed commented 3 years ago

Yes please! I am hoping to run a fragment insertion analysis with SILVA but my OTUs were picked using a SILVA 138 reference database prepared for QIIME2.

ericsson-lab commented 3 years ago

Yes please! This would be amazing!

valentynbez commented 3 years ago

Hello, I've been trying to recreate a SeppReferenceTree artefact pipeline for Silva 138.1 from the repo.

  1. I downloaded Exports/SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta and Exports/taxonomy/tax_slv_ssu_138.1.tre.gz from last SILVA release.
  2. Then I run nw_topology -bI to prepare the tree. I am interested in creating SeppReferenceDatabase for V4 region specifically.

For the moment I am particularly puzzled with the masking step from here.

The help would be much appreciated! Thank you.

smirarab commented 3 years ago

It seems @diego92sigma6 has had some luck with this issue: https://github.com/smirarab/pasta/issues/61. Perhaps he can chime in.

For the moment I am particularly puzzled with the masking step from here.

* should I first filter reference sequences from SILVA to V4 region or do this masking step?

* how should I choose the masking length properly?

That masking step is meant to remove super-gappy sites from the alignment (not just retaining V4).

Is it possible to either get scripts to do the alignment or get the new Silva 138 release? Silva updates about annually and it would be really nice to be able to update things that rely on sepp and a consistent database alongside that

I'd be happy to share the scripts. In fact, I thought everything necessary is here already: https://github.com/smirarab/sepp-refs/tree/master/silva and https://github.com/smirarab/sepp/tree/master/sepp-package/buildref

When I last tried to used SILVA 138, I ran to the issue of non-monophyly of archaea. I didn't have time to further follow up further on that.

diegomarquezp commented 3 years ago

Hi, @crusher083 hoping you are well. I'm also trying to compile 13.8, maybe we can share a couple of things. I manually performed an alignment using PASTA over the non-truncated dataset which was successful but now we are having trouble with SEPP because too many sequences are being used (whole database - +2000000 sequences). @smirarab advised to remove gappy sequences

Since my dataset is too big to run on a desktop computer (12GB fasta), I had to create a small C++ program for gappy sequence filtering that uses streams to optimize resources. This 12GB alignment with 2 million sequences is taking 120 seconds to perform the filtering. I would be really happy to help with this program if you are in a similar situation with resources availability.

My current situation is that only 3 out of the 2 million sequences were 97+% gaps, so I'm following a second piece of advice from @smirarab to filter similar sequences which I will write here for convenience.

If the running time is still high, we can think about removing sequences that are too similar to each other. For doing that, I would suggest 99% similarity or something like that. You can also use our tool TreeCluster (https://github.com/niemasd/TreeCluster) to find the optimal subset given the tree you already have.

On my side, maybe filtering to V4 only may be the step I was missing to reduce the dataset. I was wondering if there is any tool you know about to perform this task.

I'm happy to help with anything you need.

smirarab commented 3 years ago

Diego, you may have misunderstood what I asked for filtering. I was advising removing sites (so columns) not species (rows) that have more than 99.5% gaps. Did you try simply removing gappy sites?

diegomarquezp commented 3 years ago

Oh, that's my bad! It makes sense now. I will remove the gappy sites and see how it goes.

diegomarquezp commented 3 years ago

Hi Siavash, I'm getting very close to have the reference. I was wondering if you could please refer me to a resource to understand the step rooting on the lowest common ancestor of archaea. I'm honestly a bit lost in here. Does this mean associating the RAxML output tree to another preexisting one? and which tools would you use to perform this? Thank you!

smirarab commented 3 years ago

Do you have a file that tells you the taxonomy for all of the species in your sample? If so, we need to root at the LCA of archaea. I'd be happy to help with this step if you have that mapping file.

One issue that I faced was that archea was actually not monophyletic in the previous version, but there are ways around that.

On Wed, Sep 22, 2021 at 10:10 AM Diego Alonso Marquez Palacios < @.***> wrote:

Hi Siavash, I'm getting very close to have the reference. I was wondering if you could please refer me to a resource to understand the step rooting on the lowest common ancestor of archaea. I'm honestly a bit lost in here. Does this mean associating the RAxML output tree to another preexisting one? and which tools would you use to perform this? Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/smirarab/sepp-refs/issues/3#issuecomment-925120209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJXODLRK5YA3H75Z2R7ILUDIEXHANCNFSM4RVMNMXQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Siavash Mirarab

diegomarquezp commented 3 years ago

Hi Siavash, I think we can use this taxonomy file, which contains the accession and semicolon separated taxonomy path for each entry . My input for the raxml steps were the full aligned sequences from silva (ended up using these) with 99.99 sites removed, with a tree generated from fasttree. I decided to manually build the tree instead of using this one because some accessions were associated with the same taxa, producing undesired results in the raxml steps. The produced tree has accessions as nodes. I think this is correct because the sepp-ref for 12.8 is also based on an accession tree. I will expose a public folder with the results so far once the branch length step is done. I will let you know. Thank you!

smirarab commented 2 years ago

Hi Diego,

Sorry for my long silence on this. Was this successful? Any help needed from my side? Could I point people to the output of your work?

Thanks Siavash

On Fri, Sep 24, 2021 at 2:10 PM Diego Alonso Marquez Palacios < @.***> wrote:

Hi Siavash, I think we can use this taxonomy file https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/taxmap_slv_ssu_ref_nr_138.1.txt.gz, which contains the accession and semicolon separated taxonomy path for each entry . My input for the raxml steps were the full aligned sequences from silva (ended up using these) with 99.99 sites removed, with a tree generated from fasttree. I decided to manually build the tree instead of using this one https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/tax_slv_ssu_138.1.tre.gz because some accessions were associated with the same taxa, producing undesired results in the raxml steps. The produced tree has accessions as nodes. I think this is correct because the sepp-ref for 12.8 is also based on an accession tree. I will expose a public folder with the results so far once the branch length step is done. I will let you know. Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/smirarab/sepp-refs/issues/3#issuecomment-926921395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJXOAAF5RQRSN4EM4NAALUDTSNJANCNFSM4RVMNMXQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Siavash Mirarab

diegomarquezp commented 2 years ago

Hi Siavash. I had to stop this project for a while. What I have so far is the tree with branch lengths from the full aligned sequences of silva, but with many repeated sequences. The only issue left to solve besides rooting on archaea is to reperform it with the reduced sequences. There are about 40k seqs that are repeated according to raxml (even with the original, non-trimmed dataset). I think my server won't be busy for a while, so I can start redoing the tree over the non-repeated dataset (masked-dna-sequences-accession.fasta.reduced).

Here is a public gdrive folder with my work. Hope this helps