gjospin / PhyloSift

Phylogenetic and taxonomic analysis for genomes and metagenomes
82 stars 17 forks source link

error with phylosift search #501

Open ucassee opened 5 years ago

ucassee commented 5 years ago

When I run phylosift search -help it went error with following:

NCBI taxonomy data not found and unable to connect to update server, please check your phylosift configuration and internet connection! at /home/zhouyl/software/phylosift_v1.0.1/bin/../lib/Phylosift/Phylosift.pm line 154.

I downloaded markers.tgz and ncbi.tgz. Where should I set them?

Thanks in advance

gjospin commented 5 years ago

you can specify a custom directory in the phylosiftrc file, make sure you remove the # at the beginning of the line that you are using.

The default place that PS will look for things is in: <$HOME>/share/phylosift

On Thu, Jul 11, 2019 at 8:19 AM ucassee notifications@github.com wrote:

When I run phylosift search -help it went error with following:

NCBI taxonomy data not found and unable to connect to update server, please check your phylosift configuration and internet connection! at /home/zhouyl/software/phylosift_v1.0.1/bin/../lib/Phylosift/Phylosift.pm line 154.

I downloaded markers.tgz and ncbi.tgz. Where should I set them?

Thanks in advance

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gjospin/PhyloSift/issues/501?email_source=notifications&email_token=AADQKTNJOHIZRI76P57TNP3P65FPXA5CNFSM4IBKJEHKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G6VGS7A, or mute the thread https://github.com/notifications/unsubscribe-auth/AADQKTI4LERZGAVWWGLT3IDP65FPXANCNFSM4IBKJEHA .

ucassee commented 5 years ago

@gjospin Thanks for your reply. The phylosift is at /home/zhouyl/software/phylosift_v1.0.1/bin/ . I move ncbi.tgz to /home/zhouyl/software/phylosift_v1.0.1/ . But it doesn't work. It still with error

NCBI taxonomy data not found and unable to connect to update server, please check your phylosift configuration and internet connection!

gjospin commented 5 years ago

you need your path to look like

/home/zhouyl/share/phylosift/ncbi /home/zhouyl/share/phylosift/markers

On Thu, Jul 11, 2019 at 9:10 AM ucassee notifications@github.com wrote:

@gjospin https://github.com/gjospin Thanks for your reply. The phylosift is at /home/zhouyl/software/phylosift_v1.0.1/bin/ . I move ncbi.tgz to /home/zhouyl/software/phylosift_v1.0.1/ . But it doesn't work. It still with error

NCBI taxonomy data not found and unable to connect to update server, please check your phylosift configuration and internet connection!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gjospin/PhyloSift/issues/501?email_source=notifications&email_token=AADQKTLKPKAMXHQAXYXMVBTP65LONA5CNFSM4IBKJEHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZXGKMY#issuecomment-510551347, or mute the thread https://github.com/notifications/unsubscribe-auth/AADQKTNOPGOUKMQR3XB5CVLP65LONANCNFSM4IBKJEHA .

ucassee commented 5 years ago

@gjospin Thanks for your patience~ If I want to change the location of these two database, can I modify the code /home/zhouyl/software/phylosift_v1.0.1/bin/../lib/Phylosift/Phylosift.pm ? But in the line 154 I only see Phylosift::Utilities::data_checks( self => $self )

gjospin commented 5 years ago

Change /home/zhouyl/software/phylosift_v1.0.1/phylosiftrc

find the line

$marker_dir="";

change to

$marker_dir="/home/zhouyl/software/phylosift_v1.0.1/markers";

$ncbi_dir

change to

$ncbi_dir="/home/zhouyl/software/phylosift_v1.0.1/ncbi"

On Thu, Jul 11, 2019 at 9:23 AM ucassee notifications@github.com wrote:

@gjospin https://github.com/gjospin Thanks for your patience~ If I want to change the location of these two database, can I modify the code /home/zhouyl/software/phylosift_v1.0.1/bin/../lib/Phylosift/Phylosift.pm ? But in the line 154 I only see Phylosift::Utilities::data_checks( self => $self )

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gjospin/PhyloSift/issues/501?email_source=notifications&email_token=AADQKTJTFZ2U7CAK44OBXP3P65NALA5CNFSM4IBKJEHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZXHRSA#issuecomment-510556360, or mute the thread https://github.com/notifications/unsubscribe-auth/AADQKTN4H2I5G5U3Q6PEM3DP65NALANCNFSM4IBKJEHA .

ucassee commented 5 years ago

When I use phylosift to search conservative protein. I find there are too much same proteins in one of my genomes.

-rw-r--r-- 1 zhouyl microbial 1515 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.10 -rw-r--r-- 1 zhouyl microbial 1440 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.11 -rw-r--r-- 1 zhouyl microbial 1151 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.14 -rw-r--r-- 1 zhouyl microbial 377 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.15 -rw-r--r-- 1 zhouyl microbial 761 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.16 -rw-r--r-- 1 zhouyl microbial 750 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.19 -rw-r--r-- 1 zhouyl microbial 771 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.2 -rw-r--r-- 1 zhouyl microbial 389 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.20 -rw-r--r-- 1 zhouyl microbial 379 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.3 -rw-r--r-- 1 zhouyl microbial 1138 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.4 -rw-r--r-- 1 zhouyl microbial 1112 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.5 -rw-r--r-- 1 zhouyl microbial 373 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.6 -rw-r--r-- 1 zhouyl microbial 773 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.7 -rw-r--r-- 1 zhouyl microbial 413 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.8 -rw-r--r-- 1 zhouyl microbial 1159 Jul 12 12:31 DNGNGWU00013.lastal.candidate.aa.1.9

But when I use the protein sequences in these files to blast on ncbi, there is no significant similarity found. I fell confused. Could you help me ?

ucassee commented 5 years ago

When I see the marker_summary.txt.

DNGNGWU00010 1 DNGNGWU00011 1 DNGNGWU00012 1 DNGNGWU00013 35 DNGNGWU00014 1 DNGNGWU00015 1 DNGNGWU00016 1

I think it is unlikely to have 35 DNGNGWU00013 proteins in the genome. And combined with no hit in ncbi, I guess whether it is a mistake in the search progress.

ucassee commented 5 years ago

I have another another question. I use the the sequences in the *candidate.ffn* file to blast with the original genome sequences. I find when compared with original genome sequences, there are few gaps in the sequences in *candidate.ffn* file. If the ffn was extracted from the original genome, why are these gaps exist?

gjospin commented 5 years ago

The search step is really permissive in the matches. It is really meant to reduce the complexity of the alignment step so it doesn't take too long. I would perform the align step and see what sticks after that. Also keep in mind that PS was developed with short reads in mind. You may want to adjust thresholds in the phylosiftrc file in the same way you modified the paths yesterday. The thresholds might need to be adjusted depending on the length of the marker you are looking at.

How big are the gaps? Could it be some frameshift happening? The hits are done in protein space.

I hope this helps.

On Fri, Jul 12, 2019 at 6:48 AM ucassee notifications@github.com wrote:

I have another another question. I use the the sequences in the candidate.ffn file to blast with the original genome sequences. I find when compared with original genome sequences, there are few gaps in the sequences in candidate.ffn file. If the ffn was extracted from the original genome, why are these gaps exist?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gjospin/PhyloSift/issues/501?email_source=notifications&email_token=AADQKTOLGTK2AF2FZSYNCD3P7CDTJA5CNFSM4IBKJEHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZZZYYI#issuecomment-510893153, or mute the thread https://github.com/notifications/unsubscribe-auth/AADQKTNHFQY6IFBKJKRAJMDP7CDTJANCNFSM4IBKJEHA .

ucassee commented 5 years ago

The gaps are not too big, just 3,6, or 9 bp in my sequences. Most of the predicted DNGNGWU protein can find homologous protein when blast in ncbi . But DNGNGWU00013 I mentioned before could not. So do you suggest if I improve the threshold I can get more positive result? I can't find the threshold setting line in Phylosift.pm. Could you help me ?

gjospin commented 4 years ago

No, I was suggesting to increase your threshold stringency to remove the incorrect matches that should have lower hits. So if a default score is 150 in the phylosiftrc file, you would want to increase that to filter out false positives. You could enforce a minimum number of bases getting aligned also.

We have seen clades not have certain markers. It's possible marker 13 doesn't work well for your bug of interest. If the matches aren't good enough, then this marker13 space in the concatenated alignment would be all gaps. I would ignore it if you aren't happy with what comes out of it.

There is a way to give PS a list of markers (--custom flag), 1 per line to only look at matches for markers in the list. You could give it the list of 37markers minus DNGNGWU00013

On Fri, Jul 12, 2019 at 6:36 PM ucassee notifications@github.com wrote:

The gaps are not too big, just 3,6, or 9 bp in my sequences. Most of the predicted DNGNGWU protein can find homologous protein when blast in ncbi . But DNGNGWU00013 I mentioned before could not. So do you suggest if I improve the threshold I can get more positive result?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gjospin/PhyloSift/issues/501?email_source=notifications&email_token=AADQKTMQS6VVCOX5WY5EQZ3P7EWSRA5CNFSM4IBKJEHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ3GRZQ#issuecomment-511076582, or mute the thread https://github.com/notifications/unsubscribe-auth/AADQKTKLO7X6MJVO5EODGBLP7EWSRANCNFSM4IBKJEHA .

ucassee commented 4 years ago

Hi@jospin, Thanks for your hfelp. In the phylosiftrc file, I find too much parameters. I am not sure which one should I modify to increase the threshold. Is it the following ?

MarkerAlign default parameters

$min_aligned_residues=50;

$rna_split_size = 500; #sequences longer than this value will undergo the long sequence pipeline

$gap_character = "-";

gjospin commented 4 years ago

I would target the $min_aligned_residues and extend that closer to the gene(s) you are interested in. You can find the length markers in the database's HMM files. (grep 'LEN' DNGNG/.hmm for example, our system is down right now, so I can't check the exact syntax). Also keep in mind this is in AA space, so 50 represents 150 nucleotides.

On Tue, Jul 16, 2019 at 6:13 AM ucassee notifications@github.com wrote:

Hi@jospin, Thanks for your hfelp. In the phylosiftrc file, I find too much parameters. I am not sure which one should I modify to increase the threshold. Is it the following ?

MarkerAlign default parameters

$min_aligned_residues=50;

$rna_split_size = 500; #sequences longer than this value will undergo the

long sequence pipeline

$gap_character = "-";

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gjospin/PhyloSift/issues/501?email_source=notifications&email_token=AADQKTNZTPDIUNI6D5XAASTP7XCRHA5CNFSM4IBKJEHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2AZU4I#issuecomment-511810161, or mute the thread https://github.com/notifications/unsubscribe-auth/AADQKTJ3M6KTFWHLL2OJCGLP7XCRHANCNFSM4IBKJEHA .