qiime2 / q2-fragment-insertion

BSD 3-Clause "New" or "Revised" License
13 stars 17 forks source link

SILVA reference #21

Open sjanssen2 opened 6 years ago

sjanssen2 commented 6 years ago

Improvement Description It should be possible to download the QIIME compatible version of Silva and construct reference phylogeny and alignment for SEPP to enable 18S analyses.

Questions

  1. @josenavas @wasade do you know if release 128 is the latest?

  2. How and where would we host SEPP compatible references? Within this Plugin (which is already 130 MB large), on the github repo?

josenavas commented 6 years ago

We could store the references in the FTP server.

rachaellappan commented 5 years ago

Hi @sjanssen2,

Would it be possible in the near future to also create and make available in QIIME2 a pre-compiled SILVA v132 database? I note your comment here that making the database ready for use in q2-fragment-insertion takes around 2 weeks, which is my main reason for not attempting the steps outlined here by @smirarab.

It's great that a pre-compiled SILVA v128 database comes packaged with this plugin in QIIME! I've simply already done some analysis with SILVA v132 and am on a tight schedule, so don't have the time to re-analyse with 128 - at the moment this unfortunately prevents me from using the fragment insertion method to build trees.

Cheers, Rachael

thermokarst commented 5 years ago

Hey there @rachaellappan --- we would love to get some help with this task - are you interested? If you don't have the bandwidth, maybe you could cross-post this request to the QIIME 2 Forum, that way more eyes see this? Thanks!

antgonza commented 5 years ago

Just adding to the discussion. For the GG release we did a lot of benchmarks and basically this is what was used in the fragment insertion paper. However, AFAIK, such benchmarks have not been done in SILVA so it will be great if someone actually did these benchmarks, in case @rachaellappan is interested.

sjanssen2 commented 5 years ago

regarding benchmarks: there is already a lot of infrastructure in place, for example the wonderful repo https://github.com/caporaso-lab/tax-credit-data/ which I used a couple of month ago to add SEPP as another tool to assign taxonomy and of course all the notebooks I used for our paper https://msystems.asm.org/content/3/3/e00021-18

I think we should first provide the necessary changes for SEPP to deal with different references before we think too hard about benchmark results.

antgonza commented 5 years ago

I'll argue that having them at the same time would be great; as you can imagine, once it's out there, it's out there and in the case there is a bug or something wrong that wasn't caught cause there were no benchmarks, it can get ugly ... my 2 pesos!

rachaellappan commented 5 years ago

Hi @thermokarst, I will post to the QIIME2 forum. I would like to help out but I'm not very familiar with what is being done here and whether these steps are all that's required.

If I understand correctly, I agree that benchmarking SILVA (to demonstrate/confirm the improvement that fragment insertion offers over de novo trees in the case of SILVA?) would be ideal to do around the same time as providing v132 for SEPP. The SILVA aligned rep set doesn't specify whether it's 16S or 18S - does it contain both? - so the results may be different to GG.

I'm probably not the person to do this - no experience with benchmarking =)

smirarab commented 5 years ago

The file used for SILVA package is described here: https://github.com/smirarab/sepp-refs/blob/master/silva/README.md

It was called SILVA_128_QIIME_release/rep_set_aligned/99/99_otus_aligned.fasta.gz

Does anyone know if that file did or did not include 18S?

On Tue, Jan 15, 2019 at 5:18 PM Rachael Lappan notifications@github.com wrote:

Hi @thermokarst https://github.com/thermokarst, I will post to the QIIME2 forum. I would like to help out but I'm not very familiar with what is being done here https://github.com/smirarab/sepp-refs/tree/master/silva and whether these steps are all that's required.

If I understand correctly, I agree that benchmarking SILVA (to demonstrate/confirm the improvement that fragment insertion offers over de novo trees in the case of SILVA?) would be ideal to do around the same time as providing v132 for SEPP. The SILVA aligned rep set doesn't specify whether it's 16S or 18S - does it contain both? - so the results may be different to GG.

I'm probably not the person to do this - no experience with benchmarking =)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiime2/q2-fragment-insertion/issues/21#issuecomment-454614325, or mute the thread https://github.com/notifications/unsubscribe-auth/AAybuFARRxUMCxOBEQsFkkErQ911HdoBks5vDn3rgaJpZM4Qi1J8 .

-- Siavash Mirarab

adityabandla commented 5 years ago

Hey there @rachaellappan --- we would love to get some help with this task - are you interested? If you don't have the bandwidth, maybe you could cross-post this request to the QIIME 2 Forum, that way more eyes see this? Thanks!

In case this hasn't been done yet, I would be glad to pitch in. But I would need the scripts required to process the QIIME formatted SILVA file (SILVA_132_QIIME_release/rep_set_aligned/99/99_alignment.fna)

adityabandla commented 5 years ago

Can anyone confirm if these modified steps would be right (taken from https://github.com/smirarab/sepp-refs/tree/master/silva)?

99_alignment.fna has 425098 sequences run_seqtools.py -masksites 2125 -infile 99_alignment.fna -outfile 99_alignment_masked.fna nw_topology -bI 99_otus.tre > 99_otus_nice.tree raxmlHPC-PTHREADS -s 99_alignment_masked.fna -m GTRCAT -n scoreF-99_alignment_masked.fna-g 99_otus_nice.tree -F -T 24 -p 8956 raxmlHPC-PTHREADS -s 99_alignment_masked.fna -m GTRCAT -n score-bl-99_alignment_masked.fna -F -f e -t RAxML_result.scoreF-99_alignment_masked.fna -T 24 -p 10625

adityabandla commented 5 years ago

Is this issue still alive?

sjanssen2 commented 5 years ago

Hi Aditya, yes it is still current, but maybe not too active at the moment. I am very busy meeting important deadlines until mid of March. Thereafter, this is on my to do list and help is extremely welcome; since I think this issue is a show stopper for many application scenarios.

adityabandla commented 5 years ago

Hi Stefan

Sure. I was wondering if I can get started on this at my end since its a heavy compute. All I would need is if someone can confirm the steps that need to be run.

Ofcourse, I will share the files for review once done and perhaps that would be mid-March already

sjanssen2 commented 5 years ago

All I know about Silva is what Siavash did to convert / prepare the data vor Silva 12.8: https://github.com/smirarab/sepp-refs/tree/master/silva Maybe you can induce if you are dealing with the correct files?

adityabandla commented 5 years ago

Yes, Stefan, I went through what Siavash had done and am sure I have the correct files with me. I wasn't entirely clear though how the masksites parameter was chosen for the first step. That's where I need some advise as the total number of sequences is different for v132

Perhaps @smirarab can pitch in?

sjanssen2 commented 5 years ago

ups, now I see that you already pointed to this link. Sorry for not paying enough attention :-/

adityabandla commented 5 years ago

Any updates on this, we are well past mid march?

sjanssen2 commented 5 years ago

Hi Aditya,

fair point. Sorry for the delay. I started working on SEPP itself to add the ability to easily change reference in an convenient way for QIIME2 users. This procedure should include a) adding SEPP to a CI system (Travis) b) update code style c) add ability to pass info files to sepp binaries d) package SEPP as a bioconda recipe. I am happy to receive some code reviews https://github.com/smirarab/sepp/pull/41 and thus increase visibility and quality.

I just downloaded the 3 GB of Silva's QIIME compatible version 13.2 https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip I am pretty confident that the alignment file is SILVA_132_QIIME_release/rep_set_aligned/99/99_alignment.fna.zip and the matching phylogeny is SILVA_132_QIIME_release/trees/99/99_otus.tre. Both hold the very same 425,098 identifiers.

I figure you already know the right computational steps to perform, but I am not totally sure if the numeric parameters will also work for the slightly larger 13.2 release. Guess we will learn that the hard way :-/

smirarab commented 5 years ago

Aditya,

Sorry for the long silence on this.

The steps you mentioned are mostly correct. However, in the end, you need to root the tree at the LCA of Archea.

Hope this helps.

Regards Siavash

On Tue, Mar 5, 2019 at 11:10 AM Aditya Bandla notifications@github.com wrote:

Yes, Stefan, I went through what Siavash had done and am sure I have the correct files with me. I wasn't entirely clear though how the masksites parameter was chosen for the first step. That's where I need some advise as the total number of sequences is different for v132

Perhaps @smirarab https://github.com/smirarab can pitch in?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiime2/q2-fragment-insertion/issues/21#issuecomment-469818614, or mute the thread https://github.com/notifications/unsubscribe-auth/AAybuJPW0rEM_Xbka7U5Jo46o_xMOLVNks5vTsEVgaJpZM4Qi1J8 .

-- Siavash Mirarab

sjanssen2 commented 5 years ago

I am trying to create a bioconda recipe for Siavash's SEPP program (without the heavy sized reference files) to support - in the long run - different references like Silva or others. Currently, I fail linting of the recipe, since I don't know how to properly deal with the situation that python is in principle platform independent, but SEPP ships pre-compiled platform dependent binaries. Can someone please help, maybe @thermokarst or @ebolyen ?

adityabandla commented 5 years ago

Is this something being still considered?

sjanssen2 commented 5 years ago

The bioconda package has been created: https://anaconda.org/bioconda/sepp (without reference files), but is not yet integrated into Qiime2.

adityabandla commented 5 years ago

Stefan, thats great to hear. Are the updated reference files for SILVA available as well?

sjanssen2 commented 5 years ago

Hi @adityabandla,

files for Silva 12.8 (phylogeny, alignment and info) are shipped with the default Qiime2 install and should be located in $CONDA_PREFIX/share/fragment-insertion/ref (activate your conda environment first such that CONDA_PREFIX points to the right directory).

Did you succeed in creating a reference for Silva 13.2? If so, would you be willing to share those files with me / the Qiime community?

My PR #32 contains necessary updates for the qiime2 wrapper to cope with the new parameter for the info file, but it is still not merged into master. Thus, to use other references than Greengenes 13.8 you either have to overwrite the info file each time or use the run-sepp.sh script directly.

Best, Stefan

adityabandla commented 5 years ago

Hi Stefan

Sorry, I never managed to get to it. I just started and I ran into this error with the very first step

Traceback (most recent call last):
File run_seqtools.py", line 7, in <module> exec(compile(f.read(), __file__, 'exec'))
File "run_seqtools.py", line 36, in <module> alg.read_file_object(args.infile,args.informat)
File "alignment.py", line 1335, in read_file_object for name, seq in read_func(file_obj):
File "alignment.py", line 75, in read_fasta raise Exception("Error: illegal characeters in sequence at line %d" % line_number)
Exception: Error: illegal characeters in sequence at line 1
sjanssen2 commented 5 years ago

Hi @adityabandla I would need much more information about what you are trying to execute to be able to help debugging.

adityabandla commented 5 years ago

I am trying to run the following command when I get that error run_seqtools.py -masksites 2125 -infile 99_alignment.fna -outfile 99_alignment_masked.fna

Please let me know if you need additional details

smirarab commented 5 years ago

Aditya, is there a place where I can access the 99_alignment.fna file? I can try to have a look.

On Mon, Jun 24, 2019 at 9:24 PM Aditya Bandla notifications@github.com wrote:

I am trying to run the following command when I get that error run_seqtools.py -masksites 2125 -infile 99_alignment.fna -outfile 99_alignment_masked.fna

Please let me know if you need additional details

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiime2/q2-fragment-insertion/issues/21?email_source=notifications&email_token=AAGJXOD46WMM3QF3AVTBPFTP4GMWFA5CNFSM4EELKJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYO6R2Q#issuecomment-505276650, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGJXOFEBYBH3TJIXUCFTWLP4GMWFANCNFSM4EELKJ6A .

-- Siavash Mirarab

adityabandla commented 5 years ago

@smirarab Siavash, its the file I downloaded from the SILVA website, https://www.arb-silva.de/fileadmin/silva_databases/qiime/Silva_132_release.zip, the particular file being SILVA_132_QIIME_release/rep_set_aligned/99/99_alignment.fna.zip

Rkubinski commented 5 years ago

@adityabandla @smirarab is there any progress on using silva 132 ?

smirarab commented 4 years ago

I am starting to work on this. Does anyone know if unaligned sits (alignment sites with a dot) should be removed?

On Tue, Nov 5, 2019 at 8:02 AM Ryszard Kubinski notifications@github.com wrote:

@adityabandla https://github.com/adityabandla @smirarab https://github.com/smirarab is there any progress on using silva 132 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiime2/q2-fragment-insertion/issues/21?email_source=notifications&email_token=AAGJXOGQQ3OVUKMBMOX5D5LQSGKJLA5CNFSM4EELKJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDDGO6Y#issuecomment-549873531, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJXOAXUKZU4GEQJEA4TALQSGKJLANCNFSM4EELKJ6A .

-- Siavash Mirarab

smirarab commented 4 years ago

I have been working on this and now have the trees. I am having trouble with rooting the tree. There are several problematic taxa, mentioned below.

Is anyone more familiar with SILVA able to advise what's best to do here? Should we just remove these? Are they simply missclassified? Or perhaps I am using the wrong taxonomy file (SILVA_132_QIIME_release/taxonomy/taxonomy_all/99/raw_taxonomy.txt)?

On Mon, Nov 18, 2019 at 8:35 AM siavash mirarab smirarab@gmail.com wrote:

I am starting to work on this. Does anyone know if unaligned sits (alignment sites with a dot) should be removed?

On Tue, Nov 5, 2019 at 8:02 AM Ryszard Kubinski notifications@github.com wrote:

@adityabandla https://github.com/adityabandla @smirarab https://github.com/smirarab is there any progress on using silva 132 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiime2/q2-fragment-insertion/issues/21?email_source=notifications&email_token=AAGJXOGQQ3OVUKMBMOX5D5LQSGKJLA5CNFSM4EELKJ6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDDGO6Y#issuecomment-549873531, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJXOAXUKZU4GEQJEA4TALQSGKJLANCNFSM4EELKJ6A .

-- Siavash Mirarab

-- Siavash Mirarab

zhanxw commented 4 years ago

@smirarab Your question is also related to mine: https://github.com/smirarab/sepp-refs/issues/2. In SILVA 128, the FASTA file has dots too. Do you know the solution to make run_seqtools.py working?

smirarab commented 4 years ago

In answered your questions there. The issue here has to do with the tree topology.

ETaSky commented 4 years ago

Any updates on this issue? Thanks!

smirarab commented 4 years ago

I have the trees needed, but I have issues with rooting it, as mentioned above. I remain hopeful that someone with more familiarity with SILVA can tell me how the rooting issue should be dealt with.

On Tue, Jun 30, 2020 at 8:16 AM ETaSky notifications@github.com wrote:

Any updates on this issue? Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiime2/q2-fragment-insertion/issues/21#issuecomment-651860225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJXOHVG6D3YYFTVMKRMPLRZH6V7ANCNFSM4EELKJ6A .

-- Siavash Mirarab

jgerken commented 3 years ago

@smirarab the first sequence seems to be anomalous on the first view, so it might be good to exclude it. For the other sequences, I checked some of the accession numbers and they are from genome or WGS sequence set entries. Those entries, sometimes contain contaminations from different domains. I am pretty sure that this is the case here. I think we should discuss how the sequences that are included in the tree are selected and if that can be optimised to leave this problematic sequences out. By the way, the current SILVA release is 138.1.

I am not familiar with QIIME, the fragment placing plugin or SEPP. I think the easiest approach would be that you send an email to our support email address (contact(at)arb-silva.de) giving us a short summary what data is need and how it is compiled and which issues you have (maybe there are more than just the routing of the trees?). With that information we then will try to help you solving the issues you are facing. We would also like to host the reference files on the SILVA website and see if we can find a way to automatically generate them with new SILVA releases, if possible.

All the best Jan from the SILVA team

smirarab commented 3 years ago

Hi Jan,

I will initiate an email.

Thanks Siavash

On Wed, Nov 25, 2020 at 12:59 PM Jan notifications@github.com wrote:

@smirarab https://github.com/smirarab the first sequence seems to be anomalous on the first view, so it might be good to exclude it. For the other sequences, I checked some of the accession numbers and they are from genome or WGS sequence set entries. Those entries, sometimes contain contaminations from different domains. I am pretty sure that this is the case here. I think we should discuss how the sequences that are included in the tree are selected and if that can be optimised to leave this problematic sequences out. By the way, the current SILVA release is 138.1.

I am not familiar with QIIME, the fragment placing plugin or SEPP. I think the easiest approach would be that you send an email to our support email address (contact(at)arb-silva.de) giving us a short summary what data is need and how it is compiled and which issues you have (maybe there are more than just the routing of the trees?). With that information we then will try to help you solving the issues you are facing. We would also like to host the reference files on the SILVA website and see if we can find a way to automatically generate them with new SILVA releases, if possible.

All the best Jan from the SILVA team

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/qiime2/q2-fragment-insertion/issues/21#issuecomment-733942703, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJXOAG2E5AH5DBMF24IVTSRVVZ3ANCNFSM4EELKJ6A .

-- Siavash Mirarab

lisa55asil commented 2 years ago

Any update on a SLIVA reference database formatted for SEPP through qiime2?

sjanssen2 commented 2 years ago

not that I am aware of, unfortunately