jasonsahl / LS-BSR

Large scale Blast Score Ratio (BSR) analysis

GNU General Public License v3.0

37 stars 17 forks source link

vsearch vs usearch #8

Closed brigidar closed 8 years ago

brigidar commented 8 years ago

Hi, I tried to run a comparison of two PHAST prediction regions with prodigal and vsearch and there is nothing in the consensus file, but if I use usearch there is. vsearch is in the path and runs, but the output is empty. Are the settings different for vsearch vs usearch for the cutoff? /home/brigida.rusconi/vsearch/bin/vsearch LOG: 2016/01/04 17:57:37 - clustering with VSEARCH at an ID of 0.9, using 2 processors LOG: 2016/01/04 17:57:37 - VSEARCH clustering finished Best, Brigida

jasonsahl commented 8 years ago

Brigida,

Could you send me the command that you are running? Are you giving LS-BSR a file of genes with “-g”, but then also using a clustering method?

thanks, Jason

On Jan 5, 2016, at 7:40 AM, brigidar notifications@github.com wrote:

Hi, I tried to run a comparison of two PHAST prediction regions with prodigal and vsearch and there is nothing in the consensus file, but if I use usearch there is vsearch is in the path and runs, but the output is empty Are the settings different for vsearch vs usearch for the cutoff? /home/brigidarusconi/vsearch/bin/vsearch LOG: 2016/01/04 17:57:37 - clustering with VSEARCH at an ID of 09, using 2 processors LOG: 2016/01/04 17:57:37 - VSEARCH clustering finished Best, Brigida

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8.

brigidar commented 8 years ago

Hi Jason, I am running this job script on the server cluster. I made the ls_bsr.py executable and added it to the path.

$ -N lbsr_b26

$ -o lsbsr-$JOB_ID.log

$ -j y

$ -cwd

On Jan 5, 2016, at 9:26 AM, Jason Sahl notifications@github.com<mailto:notifications@github.com> wrote:

Brigida,

Could you send me the command that you are running? Are you giving LS-BSR a file of genes with “-g”, but then also using a clustering method?

thanks, Jason

On Jan 5, 2016, at 7:40 AM, brigidar notifications@github.com<mailto:notifications@github.com> wrote:

Hi, I tried to run a comparison of two PHAST prediction regions with prodigal and vsearch and there is nothing in the consensus file, but if I use usearch there is vsearch is in the path and runs, but the output is empty Are the settings different for vsearch vs usearch for the cutoff? /home/brigidarusconi/vsearch/bin/vsearch LOG: 2016/01/04 17:57:37 - clustering with VSEARCH at an ID of 09, using 2 processors LOG: 2016/01/04 17:57:37 - VSEARCH clustering finished Best, Brigida

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169034343.

jasonsahl commented 8 years ago

Thanks,

Are you using genbank files as input or FASTA?

Could you do a:

ls -la ~/PHAST/PROKKA/B26_12292015/genomes/

thanks, Jason

On Jan 5, 2016, at 8:34 AM, brigidar notifications@github.com wrote:

~/PHAST/PROKKA/B26_12292015/genomes/

brigidar commented 8 years ago

I am using two fasta files that have multiple regions predicted by PHAST in each of them. The one I ran with usearch I did directly in command line yesterday just to check and not in a job script. Might that be an issue? We are running the server cluster on SGE. Here is the output: total 1.2M drwxr-xr-x 2 brigida.rusconi 4 Jan 4 18:05 ./ drwxr-xr-x 3 brigida.rusconi 31 Jan 4 18:05 ../ -rw-r--r-- 1 brigida.rusconi 563K Jan 4 11:34 concatb26-1.fasta -rw-r--r-- 1 brigida.rusconi 509K Jan 4 11:34 concatb26-2.fasta Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:36 AM, Jason Sahl notifications@github.com<mailto:notifications@github.com> wrote:

Thanks,

Are you using genbank files as input or FASTA?

Could you do a:

ls -la ~/PHAST/PROKKA/B26_12292015/genomes/

thanks, Jason

On Jan 5, 2016, at 8:34 AM, brigidar notifications@github.com<mailto:notifications@github.com> wrote:

~/PHAST/PROKKA/B26_12292015/genomes/

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169036821.

jasonsahl commented 8 years ago

Brigida,

So what you are telling LS-BSR to do is to predict coding regions in each FASTA file, cluster them, then align the predicted regions back against each FASTA file in your “genomes” directory to determine the BSR. If you have predicted regions and want to determine their distribution across a set of genomes, you could do something like “-g concatb26-1.fasta -d genome_directory”. I currently don’t have a way to cluster a set of genes provided with the “-g” flag, but it’s something that’s on my list. Please let me know if I can clarify anyting else about how the method is working.

regards, Jason

On Jan 5, 2016, at 8:39 AM, brigidar notifications@github.com wrote:

I am using two fasta files that have multiple regions predicted by PHAST in each of them. The one I ran with usearch I did directly in command line yesterday just to check and not in a job script. Might that be an issue? We are running the server cluster on SGE. Here is the output: total 1.2M drwxr-xr-x 2 brigida.rusconi 4 Jan 4 18:05 ./ drwxr-xr-x 3 brigida.rusconi 31 Jan 4 18:05 ../ -rw-r--r-- 1 brigida.rusconi 563K Jan 4 11:34 concatb26-1.fasta -rw-r--r-- 1 brigida.rusconi 509K Jan 4 11:34 concatb26-2.fasta Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:36 AM, Jason Sahl notifications@github.com<mailto:notifications@github.com> wrote:

Thanks,

Are you using genbank files as input or FASTA?

Could you do a:

ls -la ~/PHAST/PROKKA/B26_12292015/genomes/

thanks, Jason

On Jan 5, 2016, at 8:34 AM, brigidar notifications@github.com<mailto:notifications@github.com> wrote:

~/PHAST/PROKKA/B26_12292015/genomes/

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169036821.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169037615.

brigidar commented 8 years ago

Hi Jason, The regions predicted span multiple genes and actually predict whole phages (30-60kb). I thought I could consider them like contigs. I thought it predicts the genes for each file and then cluster all of them together. Does it cluster by genome or all predicted proteins? I want to figure out how much the phage related mobilome differs between outbreak strains or other related infections. Since some of the proteins in phages are very similar I thought it would make more sense to do the de novo predicition and then cluster them so that I don’t have a lot of genes that are identical, but do not give me much information. I can also simply run it with the predicted genes that I got from prokka for all of the regions and then extract the variome. Was just curious to understand why it clusters with usearch, but not vsearch. Brigida

On Jan 5, 2016, at 9:45 AM, Jason Sahl notifications@github.com<mailto:notifications@github.com> wrote:

Brigida,

regards, Jason

On Jan 5, 2016, at 8:39 AM, brigidar notifications@github.com<mailto:notifications@github.com> wrote:

I am using two fasta files that have multiple regions predicted by PHAST in each of them. The one I ran with usearch I did directly in command line yesterday just to check and not in a job script. Might that be an issue? We are running the server cluster on SGE. Here is the output: total 1.2M drwxr-xr-x 2 brigida.rusconi 4 Jan 4 18:05 ./ drwxr-xr-x 3 brigida.rusconi 31 Jan 4 18:05 ../ -rw-r--r-- 1 brigida.rusconi 563K Jan 4 11:34 concatb26-1.fasta -rw-r--r-- 1 brigida.rusconi 509K Jan 4 11:34 concatb26-2.fasta Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:36 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Thanks,

Are you using genbank files as input or FASTA?

Could you do a:

ls -la ~/PHAST/PROKKA/B26_12292015/genomes/

thanks, Jason

On Jan 5, 2016, at 8:34 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

~/PHAST/PROKKA/B26_12292015/genomes/

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169036821.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169037615.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169039460.

jasonsahl commented 8 years ago

Brigida,

I’m sure that the two methods work slightly different. You could also try to run cd-hit to see if that works. Let me know if I can help any further.

Jason

On Jan 5, 2016, at 8:53 AM, brigidar notifications@github.com wrote:

Hi Jason, The regions predicted span multiple genes and actually predict whole phages (30-60kb). I thought I could consider them like contigs. I thought it predicts the genes for each file and then cluster all of them together. Does it cluster by genome or all predicted proteins? I want to figure out how much the phage related mobilome differs between outbreak strains or other related infections. Since some of the proteins in phages are very similar I thought it would make more sense to do the de novo predicition and then cluster them so that I don’t have a lot of genes that are identical, but do not give me much information. I can also simply run it with the predicted genes that I got from prokka for all of the regions and then extract the variome. Was just curious to understand why it clusters with usearch, but not vsearch. Brigida

Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:45 AM, Jason Sahl notifications@github.com<mailto:notifications@github.com> wrote:

Brigida,

So what you are telling LS-BSR to do is to predict coding regions in each FASTA file, cluster them, then align the predicted regions back against each FASTA file in your “genomes” directory to determine the BSR. If you have predicted regions and want to determine their distribution across a set of genomes, you could do something like “-g concatb26-1.fasta -d genome_directory”. I currently don’t have a way to cluster a set of genes provided with the “-g” flag, but it’s something that’s on my list. Please let me know if I can clarify anyting else about how the method is working.

regards, Jason

On Jan 5, 2016, at 8:39 AM, brigidar notifications@github.com<mailto:notifications@github.com> wrote:

I am using two fasta files that have multiple regions predicted by PHAST in each of them. The one I ran with usearch I did directly in command line yesterday just to check and not in a job script. Might that be an issue? We are running the server cluster on SGE. Here is the output: total 1.2M drwxr-xr-x 2 brigida.rusconi 4 Jan 4 18:05 ./ drwxr-xr-x 3 brigida.rusconi 31 Jan 4 18:05 ../ -rw-r--r-- 1 brigida.rusconi 563K Jan 4 11:34 concatb26-1.fasta -rw-r--r-- 1 brigida.rusconi 509K Jan 4 11:34 concatb26-2.fasta Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:36 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Thanks,

Are you using genbank files as input or FASTA?

Could you do a:

ls -la ~/PHAST/PROKKA/B26_12292015/genomes/

thanks, Jason

On Jan 5, 2016, at 8:34 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

~/PHAST/PROKKA/B26_12292015/genomes/

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169036821.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169037615.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169039460.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169042052.

brigidar commented 8 years ago

Hi Jason, I looked into the script and I don’t see the all_sorted.txt file created in the vsearch method (line 181). You only make the all_sorted.txt in the usearch but then in the run_vsearch you call the all_sorted.txt. I think if I read it correctly the input file is missing. Brigida

On Jan 5, 2016, at 9:57 AM, Jason Sahl notifications@github.com<mailto:notifications@github.com> wrote:

Brigida,

I’m sure that the two methods work slightly different. You could also try to run cd-hit to see if that works. Let me know if I can help any further.

Jason

On Jan 5, 2016, at 8:53 AM, brigidar notifications@github.com<mailto:notifications@github.com> wrote:

Hi Jason, The regions predicted span multiple genes and actually predict whole phages (30-60kb). I thought I could consider them like contigs. I thought it predicts the genes for each file and then cluster all of them together. Does it cluster by genome or all predicted proteins? I want to figure out how much the phage related mobilome differs between outbreak strains or other related infections. Since some of the proteins in phages are very similar I thought it would make more sense to do the de novo predicition and then cluster them so that I don’t have a lot of genes that are identical, but do not give me much information. I can also simply run it with the predicted genes that I got from prokka for all of the regions and then extract the variome. Was just curious to understand why it clusters with usearch, but not vsearch. Brigida

Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:45 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Brigida,

So what you are telling LS-BSR to do is to predict coding regions in each FASTA file, cluster them, then align the predicted regions back against each FASTA file in your “genomes” directory to determine the BSR. If you have predicted regions and want to determine their distribution across a set of genomes, you could do something like “-g concatb26-1.fasta -d genome_directory”. I currently don’t have a way to cluster a set of genes provided with the “-g” flag, but it’s something that’s on my list. Please let me know if I can clarify anyting else about how the method is working.

regards, Jason

On Jan 5, 2016, at 8:39 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

I am using two fasta files that have multiple regions predicted by PHAST in each of them. The one I ran with usearch I did directly in command line yesterday just to check and not in a job script. Might that be an issue? We are running the server cluster on SGE. Here is the output: total 1.2M drwxr-xr-x 2 brigida.rusconi 4 Jan 4 18:05 ./ drwxr-xr-x 3 brigida.rusconi 31 Jan 4 18:05 ../ -rw-r--r-- 1 brigida.rusconi 563K Jan 4 11:34 concatb26-1.fasta -rw-r--r-- 1 brigida.rusconi 509K Jan 4 11:34 concatb26-2.fasta Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:36 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

Thanks,

Are you using genbank files as input or FASTA?

Could you do a:

ls -la ~/PHAST/PROKKA/B26_12292015/genomes/

thanks, Jason

On Jan 5, 2016, at 8:34 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

~/PHAST/PROKKA/B26_12292015/genomes/

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169036821.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169037615.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169039460.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169042052.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169043103.

jasonsahl commented 8 years ago

You’re right, thanks for finding that. I’m testing the changes now and will push them up to github as soon as everything is working correctly. Thanks!

Jason

On Jan 5, 2016, at 9:28 AM, brigidar notifications@github.com wrote:

Hi Jason, I looked into the script and I don’t see the all_sorted.txt file created in the vsearch method (line 181). You only make the all_sorted.txt in the usearch but then in the run_vsearch you call the all_sorted.txt. I think if I read it correctly the input file is missing. Brigida

Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:57 AM, Jason Sahl notifications@github.com<mailto:notifications@github.com> wrote:

Brigida,

I’m sure that the two methods work slightly different. You could also try to run cd-hit to see if that works. Let me know if I can help any further.

Jason

On Jan 5, 2016, at 8:53 AM, brigidar notifications@github.com<mailto:notifications@github.com> wrote:

Hi Jason, The regions predicted span multiple genes and actually predict whole phages (30-60kb). I thought I could consider them like contigs. I thought it predicts the genes for each file and then cluster all of them together. Does it cluster by genome or all predicted proteins? I want to figure out how much the phage related mobilome differs between outbreak strains or other related infections. Since some of the proteins in phages are very similar I thought it would make more sense to do the de novo predicition and then cluster them so that I don’t have a lot of genes that are identical, but do not give me much information. I can also simply run it with the predicted genes that I got from prokka for all of the regions and then extract the variome. Was just curious to understand why it clusters with usearch, but not vsearch. Brigida

Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:45 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Brigida,

So what you are telling LS-BSR to do is to predict coding regions in each FASTA file, cluster them, then align the predicted regions back against each FASTA file in your “genomes” directory to determine the BSR. If you have predicted regions and want to determine their distribution across a set of genomes, you could do something like “-g concatb26-1.fasta -d genome_directory”. I currently don’t have a way to cluster a set of genes provided with the “-g” flag, but it’s something that’s on my list. Please let me know if I can clarify anyting else about how the method is working.

regards, Jason

On Jan 5, 2016, at 8:39 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

I am using two fasta files that have multiple regions predicted by PHAST in each of them. The one I ran with usearch I did directly in command line yesterday just to check and not in a job script. Might that be an issue? We are running the server cluster on SGE. Here is the output: total 1.2M drwxr-xr-x 2 brigida.rusconi 4 Jan 4 18:05 ./ drwxr-xr-x 3 brigida.rusconi 31 Jan 4 18:05 ../ -rw-r--r-- 1 brigida.rusconi 563K Jan 4 11:34 concatb26-1.fasta -rw-r--r-- 1 brigida.rusconi 509K Jan 4 11:34 concatb26-2.fasta Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:36 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

Thanks,

Are you using genbank files as input or FASTA?

Could you do a:

ls -la ~/PHAST/PROKKA/B26_12292015/genomes/

thanks, Jason

On Jan 5, 2016, at 8:34 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

~/PHAST/PROKKA/B26_12292015/genomes/

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169036821.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169037615.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169039460.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169042052.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169043103.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169051082.

brigidar commented 8 years ago

ah good. So why do you need to split the files before the clustering? Is that only required for usearch or for any clustering method? If you split by the line might that not split up predicted genes? Best, Brigida

On Jan 5, 2016, at 10:38 AM, Jason Sahl notifications@github.com<mailto:notifications@github.com> wrote:

You’re right, thanks for finding that. I’m testing the changes now and will push them up to github as soon as everything is working correctly. Thanks!

Jason

On Jan 5, 2016, at 9:28 AM, brigidar notifications@github.com<mailto:notifications@github.com> wrote:

Hi Jason, I looked into the script and I don’t see the all_sorted.txt file created in the vsearch method (line 181). You only make the all_sorted.txt in the usearch but then in the run_vsearch you call the all_sorted.txt. I think if I read it correctly the input file is missing. Brigida

Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:57 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Brigida,

I’m sure that the two methods work slightly different. You could also try to run cd-hit to see if that works. Let me know if I can help any further.

Jason

On Jan 5, 2016, at 8:53 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Jason, The regions predicted span multiple genes and actually predict whole phages (30-60kb). I thought I could consider them like contigs. I thought it predicts the genes for each file and then cluster all of them together. Does it cluster by genome or all predicted proteins? I want to figure out how much the phage related mobilome differs between outbreak strains or other related infections. Since some of the proteins in phages are very similar I thought it would make more sense to do the de novo predicition and then cluster them so that I don’t have a lot of genes that are identical, but do not give me much information. I can also simply run it with the predicted genes that I got from prokka for all of the regions and then extract the variome. Was just curious to understand why it clusters with usearch, but not vsearch. Brigida

Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:45 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

Brigida,

So what you are telling LS-BSR to do is to predict coding regions in each FASTA file, cluster them, then align the predicted regions back against each FASTA file in your “genomes” directory to determine the BSR. If you have predicted regions and want to determine their distribution across a set of genomes, you could do something like “-g concatb26-1.fasta -d genome_directory”. I currently don’t have a way to cluster a set of genes provided with the “-g” flag, but it’s something that’s on my list. Please let me know if I can clarify anyting else about how the method is working.

regards, Jason

On Jan 5, 2016, at 8:39 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

I am using two fasta files that have multiple regions predicted by PHAST in each of them. The one I ran with usearch I did directly in command line yesterday just to check and not in a job script. Might that be an issue? We are running the server cluster on SGE. Here is the output: total 1.2M drwxr-xr-x 2 brigida.rusconi 4 Jan 4 18:05 ./ drwxr-xr-x 3 brigida.rusconi 31 Jan 4 18:05 ../ -rw-r--r-- 1 brigida.rusconi 563K Jan 4 11:34 concatb26-1.fasta -rw-r--r-- 1 brigida.rusconi 509K Jan 4 11:34 concatb26-2.fasta Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:36 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

Thanks,

Are you using genbank files as input or FASTA?

Could you do a:

ls -la ~/PHAST/PROKKA/B26_12292015/genomes/

thanks, Jason

On Jan 5, 2016, at 8:34 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

~/PHAST/PROKKA/B26_12292015/genomes/

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169036821.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169037615.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169039460.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169042052.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169043103.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169051082.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169054424.

jasonsahl commented 8 years ago

It’s to get around the memory limitations in the free version of USEARCH. Definitely a hack, but I didn’t know how else to do it. But you’re right, I used to have a function that would take out the line wraps and gurantee that you would never interrupt a gene, but I took out that function and now it could cause problems. Thanks also for that, I will look into a new workaround.

Jason

On Jan 5, 2016, at 9:41 AM, brigidar notifications@github.com wrote:

ah good. So why do you need to split the files before the clustering? Is that only required for usearch or for any clustering method? If you split by the line might that not split up predicted genes? Best, Brigida

Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 10:38 AM, Jason Sahl notifications@github.com<mailto:notifications@github.com> wrote:

You’re right, thanks for finding that. I’m testing the changes now and will push them up to github as soon as everything is working correctly. Thanks!

Jason

On Jan 5, 2016, at 9:28 AM, brigidar notifications@github.com<mailto:notifications@github.com> wrote:

Hi Jason, I looked into the script and I don’t see the all_sorted.txt file created in the vsearch method (line 181). You only make the all_sorted.txt in the usearch but then in the run_vsearch you call the all_sorted.txt. I think if I read it correctly the input file is missing. Brigida

Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:57 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Brigida,

I’m sure that the two methods work slightly different. You could also try to run cd-hit to see if that works. Let me know if I can help any further.

Jason

On Jan 5, 2016, at 8:53 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

Hi Jason, The regions predicted span multiple genes and actually predict whole phages (30-60kb). I thought I could consider them like contigs. I thought it predicts the genes for each file and then cluster all of them together. Does it cluster by genome or all predicted proteins? I want to figure out how much the phage related mobilome differs between outbreak strains or other related infections. Since some of the proteins in phages are very similar I thought it would make more sense to do the de novo predicition and then cluster them so that I don’t have a lot of genes that are identical, but do not give me much information. I can also simply run it with the predicted genes that I got from prokka for all of the regions and then extract the variome. Was just curious to understand why it clusters with usearch, but not vsearch. Brigida

Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:45 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

Brigida,

So what you are telling LS-BSR to do is to predict coding regions in each FASTA file, cluster them, then align the predicted regions back against each FASTA file in your “genomes” directory to determine the BSR. If you have predicted regions and want to determine their distribution across a set of genomes, you could do something like “-g concatb26-1.fasta -d genome_directory”. I currently don’t have a way to cluster a set of genes provided with the “-g” flag, but it’s something that’s on my list. Please let me know if I can clarify anyting else about how the method is working.

regards, Jason

On Jan 5, 2016, at 8:39 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

I am using two fasta files that have multiple regions predicted by PHAST in each of them. The one I ran with usearch I did directly in command line yesterday just to check and not in a job script. Might that be an issue? We are running the server cluster on SGE. Here is the output: total 1.2M drwxr-xr-x 2 brigida.rusconi 4 Jan 4 18:05 ./ drwxr-xr-x 3 brigida.rusconi 31 Jan 4 18:05 ../ -rw-r--r-- 1 brigida.rusconi 563K Jan 4 11:34 concatb26-1.fasta -rw-r--r-- 1 brigida.rusconi 509K Jan 4 11:34 concatb26-2.fasta Brigida Rusconi, PhD | Postdoctoral Fellow | Department of Biology | South Texas Center for Emerging Infectious Diseases | University of Texas at San Antonio | One UTSA Circle | TX 78249 | 210-458-7846 | BSE 3.404 | brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edumailto:brigida.rusconi@utsa.edu

On Jan 5, 2016, at 9:36 AM, Jason Sahl notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

Thanks,

Are you using genbank files as input or FASTA?

Could you do a:

ls -la ~/PHAST/PROKKA/B26_12292015/genomes/

thanks, Jason

On Jan 5, 2016, at 8:34 AM, brigidar notifications@github.com<mailto:notifications@github.commailto:notifications@github.commailto:notifications@github.commailto:notifications@github.com> wrote:

~/PHAST/PROKKA/B26_12292015/genomes/

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169036821.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169037615.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169039460.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169042052.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169043103.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169051082.

— Reply to this email directly or view it on GitHubhttps://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169054424.

— Reply to this email directly or view it on GitHub https://github.com/jasonsahl/LS-BSR/issues/8#issuecomment-169055907.

jasonsahl commented 8 years ago

These problems should now be fixed. Please let me know if you see anything else that doesn't look correct.