Open jwdebelius opened 4 years ago
Yes please! I am hoping to run a fragment insertion analysis with SILVA but my OTUs were picked using a SILVA 138 reference database prepared for QIIME2.
Yes please! This would be amazing!
Hello, I've been trying to recreate a SeppReferenceTree artefact pipeline for Silva 138.1 from the repo.
For the moment I am particularly puzzled with the masking step from here.
The help would be much appreciated! Thank you.
It seems @diego92sigma6 has had some luck with this issue: https://github.com/smirarab/pasta/issues/61. Perhaps he can chime in.
For the moment I am particularly puzzled with the masking step from here.
* should I first filter reference sequences from SILVA to V4 region or do this masking step? * how should I choose the masking length properly?
That masking step is meant to remove super-gappy sites from the alignment (not just retaining V4).
Is it possible to either get scripts to do the alignment or get the new Silva 138 release? Silva updates about annually and it would be really nice to be able to update things that rely on sepp and a consistent database alongside that
I'd be happy to share the scripts. In fact, I thought everything necessary is here already: https://github.com/smirarab/sepp-refs/tree/master/silva and https://github.com/smirarab/sepp/tree/master/sepp-package/buildref
When I last tried to used SILVA 138, I ran to the issue of non-monophyly of archaea. I didn't have time to further follow up further on that.
Hi, @crusher083 hoping you are well. I'm also trying to compile 13.8, maybe we can share a couple of things. I manually performed an alignment using PASTA over the non-truncated dataset which was successful but now we are having trouble with SEPP because too many sequences are being used (whole database - +2000000 sequences). @smirarab advised to remove gappy sequences
Since my dataset is too big to run on a desktop computer (12GB fasta), I had to create a small C++ program for gappy sequence filtering that uses streams to optimize resources. This 12GB alignment with 2 million sequences is taking 120 seconds to perform the filtering. I would be really happy to help with this program if you are in a similar situation with resources availability.
My current situation is that only 3 out of the 2 million sequences were 97+% gaps, so I'm following a second piece of advice from @smirarab to filter similar sequences which I will write here for convenience.
If the running time is still high, we can think about removing sequences that are too similar to each other. For doing that, I would suggest 99% similarity or something like that. You can also use our tool TreeCluster (https://github.com/niemasd/TreeCluster) to find the optimal subset given the tree you already have.
On my side, maybe filtering to V4 only may be the step I was missing to reduce the dataset. I was wondering if there is any tool you know about to perform this task.
I'm happy to help with anything you need.
Diego, you may have misunderstood what I asked for filtering. I was advising removing sites (so columns) not species (rows) that have more than 99.5% gaps. Did you try simply removing gappy sites?
Oh, that's my bad! It makes sense now. I will remove the gappy sites and see how it goes.
Hi Siavash, I'm getting very close to have the reference. I was wondering if you could please refer me to a resource to understand the step rooting on the lowest common ancestor of archaea. I'm honestly a bit lost in here. Does this mean associating the RAxML output tree to another preexisting one? and which tools would you use to perform this? Thank you!
Do you have a file that tells you the taxonomy for all of the species in your sample? If so, we need to root at the LCA of archaea. I'd be happy to help with this step if you have that mapping file.
One issue that I faced was that archea was actually not monophyletic in the previous version, but there are ways around that.
On Wed, Sep 22, 2021 at 10:10 AM Diego Alonso Marquez Palacios < @.***> wrote:
Hi Siavash, I'm getting very close to have the reference. I was wondering if you could please refer me to a resource to understand the step rooting on the lowest common ancestor of archaea. I'm honestly a bit lost in here. Does this mean associating the RAxML output tree to another preexisting one? and which tools would you use to perform this? Thank you!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/smirarab/sepp-refs/issues/3#issuecomment-925120209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJXODLRK5YA3H75Z2R7ILUDIEXHANCNFSM4RVMNMXQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Siavash Mirarab
Hi Siavash, I think we can use this taxonomy file, which contains the accession and semicolon separated taxonomy path for each entry . My input for the raxml steps were the full aligned sequences from silva (ended up using these) with 99.99 sites removed, with a tree generated from fasttree. I decided to manually build the tree instead of using this one because some accessions were associated with the same taxa, producing undesired results in the raxml steps. The produced tree has accessions as nodes. I think this is correct because the sepp-ref for 12.8 is also based on an accession tree. I will expose a public folder with the results so far once the branch length step is done. I will let you know. Thank you!
Hi Diego,
Sorry for my long silence on this. Was this successful? Any help needed from my side? Could I point people to the output of your work?
Thanks Siavash
On Fri, Sep 24, 2021 at 2:10 PM Diego Alonso Marquez Palacios < @.***> wrote:
Hi Siavash, I think we can use this taxonomy file https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/taxmap_slv_ssu_ref_nr_138.1.txt.gz, which contains the accession and semicolon separated taxonomy path for each entry . My input for the raxml steps were the full aligned sequences from silva (ended up using these) with 99.99 sites removed, with a tree generated from fasttree. I decided to manually build the tree instead of using this one https://www.arb-silva.de/fileadmin/silva_databases/release_138_1/Exports/taxonomy/tax_slv_ssu_138.1.tre.gz because some accessions were associated with the same taxa, producing undesired results in the raxml steps. The produced tree has accessions as nodes. I think this is correct because the sepp-ref for 12.8 is also based on an accession tree. I will expose a public folder with the results so far once the branch length step is done. I will let you know. Thank you!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/smirarab/sepp-refs/issues/3#issuecomment-926921395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJXOAAF5RQRSN4EM4NAALUDTSNJANCNFSM4RVMNMXQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Siavash Mirarab
Hi Siavash.
I had to stop this project for a while.
What I have so far is the tree with branch lengths from the full aligned sequences of silva, but with many repeated sequences. The only issue left to solve besides rooting on archaea is to reperform it with the reduced sequences. There are about 40k seqs that are repeated according to raxml (even with the original, non-trimmed dataset).
I think my server won't be busy for a while, so I can start redoing the tree over the non-repeated dataset (masked-dna-sequences-accession.fasta.reduced
).
Here is a public gdrive folder with my work. Hope this helps
Is it possible to either get scripts to do the alignment or get the new Silva 138 release? Silva updates about annually and it would be really nice to be able to update things that rely on sepp and a consistent database along side that