qiyunlab / HGTector

HGTector2: Genome-wide prediction of horizontal gene transfer based on distribution of sequence homology patterns.
BSD 3-Clause "New" or "Revised" License
119 stars 34 forks source link

Summarize HGT Events by gene orthology spawns runaway perl processes #8

Closed gpharhay closed 3 years ago

gpharhay commented 8 years ago

Hello: Let me first say that I am impressed with the HGTector pipeline. I fired most of it up without any problems -- BLAST works well as does the integration with R and perl . The problem that I am running into is when I choose to summarize HGT events by orthology:

Summarize HGT events by gene orthology

byOrthology=1

All pairRules that I tried failed in the same manner, namely the pipeline fires of multiple perl jobs (probably in the 10's of thousands, very likely much more), consumes all available RAM (in my case 128 GB) and then expands into swap space. The jobs never finish. When I control-C out, there is voluminous bash output ending with

Error: Identification of orthology failed. Error: Identification of orthology failed. Error: Identification of orthology failed. Error: Identification of orthology failed. Execution of analyzer.pl failed.

I tried to run with different tree builders (fasttree,mafft)

I'm attaching a config file that I've run on input files that I've generated in previous steps (BLAST, Taxonomy, and etc already done). Maybe I'm setting things up wrong. Do you have any suggestions? Thanks, Greg

config.txt

qiyunzhu commented 8 years ago

Hello Greg,

Thanks for your interest in our program! I think your problem is not due to byOrthology=1, because that step is executed in reporter.pl, instead of analyzer.pl. I looked at the config.txt you provided. I guess that the problem may have something to do with httpBlast=1, which, instructs the program to send BLAST requests directly to the NCBI server and retrieve the results. Since NCBI is updating itself all the time, there may be some new conditions that I didn't consider when designing the program, resulting in errors. Therefore, I would suggest that you run a local BLAST (or more preferrably, Diamond). Plus, you didn't specify the selfTax parameter, which may cause some issues.

At this moment, let's assume that the BLAST results are fine. Can you try do turn of byOrthology and see if you can get reasonable results? If not, probably we need to debug the BLAST step.

Best, Qiyun

On Sat, Sep 3, 2016 at 11:52 AM, gpharhay notifications@github.com wrote:

Hello: Let me first say that I am impressed with the HGTector pipeline. I fired most of it up without any problems -- BLAST works well as does the integration with R and perl . The problem that I am running into is when I choose to summarize HGT events by orthology:

Summarize HGT events by gene orthology

byOrthology=1

All pairRules that I tried failed in the same manner, namely the pipeline fires of multiple perl jobs (probably in the 10's of thousands, very likely much more), consumes all available RAM (in my case 128 GB) and then expands into swap space. The jobs never finish. When I control-C out, there is voluminous bash output ending with

Error: Identification of orthology failed. Error: Identification of orthology failed. Error: Identification of orthology failed. Error: Identification of orthology failed. Execution of analyzer.pl failed.

I tried to run with different tree builders (fasttree,mafft)

I'm attaching a config file that I've run on input files that I've generated in previous steps (BLAST, Taxonomy, and etc already done). Maybe I'm setting things up wrong. Do you have any suggestions? Thanks, Greg

config.txt https://github.com/DittmarLab/HGTector/files/453540/config.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DittmarLab/HGTector/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/AEMVN5sBKTYRPRNe7_7u3cJVNfByTLcFks5qmcH6gaJpZM4J0XqL .

gpharhay commented 8 years ago

Hi Qiyun: If I turn byOrthology off, the pipeline works fine. I'm working on a data set where the BLAST was precomputed using the the attached config.txt. Sorry about using two different config.txt files. The pipline completed without errors, it is just when I set byOrthology=1 is when the run away perl jobs fire off. config.txt

qiyunzhu commented 8 years ago

Hello Greg, Here are a few things you can check: 1) How many input genomes did you feed the program? Orthology detection only works when there are multiple input genomes (by the definition of "orthology"...)

2) The program finishes without error does not necessarily assure that there wasn't problems. Can you please take a look at the prediction results to see if they make sense. For example, if the program told you that gene X, Y, and Z are horizontally acquired. If you eyeball these genes (anyhow), do they look like HGT-derived?

3) If the two steps above didn't sort out any problems, well, let's assume that the Perl script orthologer.pl has bugs... You can try to use an existing ortholog scheme generated by an external program (such as OrthoMCL) to feed HGTector (byOrthology=1 and smOrthology=/path/to/orthomcl.output).

Best, Qiyun

gpharhay commented 8 years ago

Hi Qiyun:

Thanks for the suggestions. Here are the answers to your questions.

1) I originally started with 16 genomes, and narrowed that down to 3 genomes for troubleshooting purposes. I am currently testing on 3 genomes. Is that enough?

2) I did a very cursory check and the HGT genes looked reasonable. I’ll do careful check tomorrow and get back to you.

3) Hum, I’ll check into OrthoMCL tomorrow. Will I need to re-BLAST everything for OrthoMCL.

I have a putative list of orthologs (reciprocal best blast hit pairs). Could you point me to a link or provide instructions how to format my pre-existing list of orthologs?

Best, Greg

From: Qiyun Zhu notifications@github.com Reply-To: DittmarLab/HGTector reply@reply.github.com Date: Tuesday, September 6, 2016 at 3:56 PM To: DittmarLab/HGTector HGTector@noreply.github.com Cc: Gregory Harhay Gregory.Harhay@ARS.USDA.GOV, Author author@noreply.github.com Subject: Re: [DittmarLab/HGTector] Summarize HGT Events by gene orthology spawns runaway perl processes (#8)

Hello Greg, Here are a few things you can check: 1) How many input genomes did you feed the program? Orthology detection only works when there are multiple input genomes (by the definition of "orthology"...)

2) The program finishes without error does not necessarily assure that there wasn't problems. Can you please take a look at the prediction results to see if they make sense. For example, if the program told you that gene X, Y, and Z are horizontally acquired. If you eyeball these genes (anyhow), do they look like HGT-derived?

3) If the two steps above didn't sort out any problems, well, let's assume that the Perl script orthologer.pl has bugs... You can try to use an existing ortholog scheme generated by an external program (such as OrthoMCL) to feed HGTector (byOrthology=1 and smOrthology=/path/to/orthomcl.output).

Best, Qiyun

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/DittmarLab/HGTector/issues/8#issuecomment-245088817, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFIB3TPaPBArL_GJ9Y4sstG9ZIxY713Dks5qndObgaJpZM4J0XqL.

This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.

qiyunzhu commented 8 years ago

Hello Greg,

Glad to hear that the prediction results sounded reasonable at your first glance. Three genome is sufficient (just meets minimal requirement) for orthology identification. OrthoMCL will perform BLAST automatically. It won't take much time, as the BLAST will be merely against each other (instead of against the entire database). If you already have a list of orthologs, you can format it like this:

ID1ProductProtein1Protein2Protein3...

For example:

1|hypothetical protein Rsl_1 YP_001492846/YP_001494105/NP_359638/AEV91698/YP_246017/YP_002844732/YP_001498897 2|thioredoxin YP_001492847/YP_246018/NP_359639/YP_002844733/YP_001494106/YP_001498898/AEV91699 3|O-antigen export system ATP-binding protein RfbE YP_001492848/YP_246019/YP_001498899/YP_001494107/YP_002844734/NP_359640/AEV91700 4|O-antigen export system permease RfbA YP_001492849/YP_246020/YP_002844735/AEV91701/NP_359641/YP_001494108/YP_001498900 5|NADPH-dependent glutamate synthase subunit beta-like oxidoreductase YP_001492850/YP_246021/YP_002844736/YP_001498901/YP_001494109/AEV91702/NP_359642 6|UDP-N-acetylglucosamine acyltransferase YP_001492851/YP_246022/YP_002844739/AEV91704/YP_001494111/YP_001498902/NP_359645 7|(3R)-hydroxymyristoyl-ACP dehydratase YP_001492852 8|hypothetical protein A1C_00040 YP_001492853 9|(3R)-hydroxymyristoyl-ACP dehydratase YP_001492854/YP_246023/YP_001498903/YP_001494112/AEV91705/NP_359646/YP_002844740 10|UDP-3-O-[3-hydroxymyristoyl] glucosamine N-acyltransferase YP_001492855/YP_246024/AEV91706/YP_002844741/NP_359647/YP_001494113/YP_001498904

Then save it as $wkDir/taxonomy/orthology.db. Then just turn on the byOrthology parameter and go.

Best, Qiyun

On Tue, Sep 6, 2016 at 2:16 PM, gpharhay notifications@github.com wrote:

Hi Qiyun:

Thanks for the suggestions. Here are the answers to your questions.

1) I originally started with 16 genomes, and narrowed that down to 3 genomes for troubleshooting purposes. I am currently testing on 3 genomes. Is that enough?

2) I did a very cursory check and the HGT genes looked reasonable. I’ll do careful check tomorrow and get back to you.

3) Hum, I’ll check into OrthoMCL tomorrow. Will I need to re-BLAST everything for OrthoMCL.

I have a putative list of orthologs (reciprocal best blast hit pairs). Could you point me to a link or provide instructions how to format my pre-existing list of orthologs?

Best, Greg

From: Qiyun Zhu notifications@github.com Reply-To: DittmarLab/HGTector reply@reply.github.com Date: Tuesday, September 6, 2016 at 3:56 PM To: DittmarLab/HGTector HGTector@noreply.github.com Cc: Gregory Harhay Gregory.Harhay@ARS.USDA.GOV, Author < author@noreply.github.com> Subject: Re: [DittmarLab/HGTector] Summarize HGT Events by gene orthology spawns runaway perl processes (#8)

Hello Greg, Here are a few things you can check: 1) How many input genomes did you feed the program? Orthology detection only works when there are multiple input genomes (by the definition of "orthology"...)

2) The program finishes without error does not necessarily assure that there wasn't problems. Can you please take a look at the prediction results to see if they make sense. For example, if the program told you that gene X, Y, and Z are horizontally acquired. If you eyeball these genes (anyhow), do they look like HGT-derived?

3) If the two steps above didn't sort out any problems, well, let's assume that the Perl script orthologer.pl has bugs... You can try to use an existing ortholog scheme generated by an external program (such as OrthoMCL) to feed HGTector (byOrthology=1 and smOrthology=/path/to/orthomcl.output).

Best, Qiyun

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ DittmarLab/HGTector/issues/8#issuecomment-245088817, or mute the thread< https://github.com/notifications/unsubscribe-auth/AFIB3TPaPBArL_ GJ9Y4sstG9ZIxY713Dks5qndObgaJpZM4J0XqL>.

This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DittmarLab/HGTector/issues/8#issuecomment-245094340, or mute the thread https://github.com/notifications/unsubscribe-auth/AEMVNyawJniywL-jrOop4UgkFDOpwu6cks5qndgcgaJpZM4J0XqL .

gpharhay commented 8 years ago

Hi Qiyun:

Thanks for the ortholog formatting info. What about the Blast2GO output? What BLAST2GO output will your pipeline accept at input? Thanks.

Best, Greg

From: Qiyun Zhu notifications@github.com Reply-To: DittmarLab/HGTector reply@reply.github.com Date: Wednesday, September 7, 2016 at 1:54 AM To: DittmarLab/HGTector HGTector@noreply.github.com Cc: Gregory Harhay Gregory.Harhay@ARS.USDA.GOV, Author author@noreply.github.com Subject: Re: [DittmarLab/HGTector] Summarize HGT Events by gene orthology spawns runaway perl processes (#8)

Hello Greg,

Glad to hear that the prediction results sounded reasonable at your first glance. Three genome is sufficient (just meets minimal requirement) for orthology identification. OrthoMCL will perform BLAST automatically. It won't take much time, as the BLAST will be merely against each other (instead of against the entire database). If you already have a list of orthologs, you can format it like this:

ID1ProductProtein1Protein2Protein3...

For example:

1|hypothetical protein Rsl_1 YP_001492846/YP_001494105/NP_359638/AEV91698/YP_246017/YP_002844732/YP_001498897 2|thioredoxin YP_001492847/YP_246018/NP_359639/YP_002844733/YP_001494106/YP_001498898/AEV91699 3|O-antigen export system ATP-binding protein RfbE YP_001492848/YP_246019/YP_001498899/YP_001494107/YP_002844734/NP_359640/AEV91700 4|O-antigen export system permease RfbA YP_001492849/YP_246020/YP_002844735/AEV91701/NP_359641/YP_001494108/YP_001498900 5|NADPH-dependent glutamate synthase subunit beta-like oxidoreductase YP_001492850/YP_246021/YP_002844736/YP_001498901/YP_001494109/AEV91702/NP_359642 6|UDP-N-acetylglucosamine acyltransferase YP_001492851/YP_246022/YP_002844739/AEV91704/YP_001494111/YP_001498902/NP_359645 7|(3R)-hydroxymyristoyl-ACP dehydratase YP_001492852 8|hypothetical protein A1C_00040 YP_001492853 9|(3R)-hydroxymyristoyl-ACP dehydratase YP_001492854/YP_246023/YP_001498903/YP_001494112/AEV91705/NP_359646/YP_002844740 10|UDP-3-O-[3-hydroxymyristoyl] glucosamine N-acyltransferase YP_001492855/YP_246024/AEV91706/YP_002844741/NP_359647/YP_001494113/YP_001498904

Then save it as $wkDir/taxonomy/orthology.db. Then just turn on the byOrthology parameter and go.

Best, Qiyun

On Tue, Sep 6, 2016 at 2:16 PM, gpharhay notifications@github.com wrote:

Hi Qiyun:

Thanks for the suggestions. Here are the answers to your questions.

1) I originally started with 16 genomes, and narrowed that down to 3 genomes for troubleshooting purposes. I am currently testing on 3 genomes. Is that enough?

2) I did a very cursory check and the HGT genes looked reasonable. I’ll do careful check tomorrow and get back to you.

3) Hum, I’ll check into OrthoMCL tomorrow. Will I need to re-BLAST everything for OrthoMCL.

I have a putative list of orthologs (reciprocal best blast hit pairs). Could you point me to a link or provide instructions how to format my pre-existing list of orthologs?

Best, Greg

From: Qiyun Zhu notifications@github.com Reply-To: DittmarLab/HGTector reply@reply.github.com Date: Tuesday, September 6, 2016 at 3:56 PM To: DittmarLab/HGTector HGTector@noreply.github.com Cc: Gregory Harhay Gregory.Harhay@ARS.USDA.GOV, Author < author@noreply.github.com> Subject: Re: [DittmarLab/HGTector] Summarize HGT Events by gene orthology spawns runaway perl processes (#8)

Hello Greg, Here are a few things you can check: 1) How many input genomes did you feed the program? Orthology detection only works when there are multiple input genomes (by the definition of "orthology"...)

2) The program finishes without error does not necessarily assure that there wasn't problems. Can you please take a look at the prediction results to see if they make sense. For example, if the program told you that gene X, Y, and Z are horizontally acquired. If you eyeball these genes (anyhow), do they look like HGT-derived?

3) If the two steps above didn't sort out any problems, well, let's assume that the Perl script orthologer.pl has bugs... You can try to use an existing ortholog scheme generated by an external program (such as OrthoMCL) to feed HGTector (byOrthology=1 and smOrthology=/path/to/orthomcl.output).

Best, Qiyun

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ DittmarLab/HGTector/issues/8#issuecomment-245088817, or mute the thread< https://github.com/notifications/unsubscribe-auth/AFIB3TPaPBArL_ GJ9Y4sstG9ZIxY713Dks5qndObgaJpZM4J0XqL>.

This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DittmarLab/HGTector/issues/8#issuecomment-245094340, or mute the thread https://github.com/notifications/unsubscribe-auth/AEMVNyawJniywL-jrOop4UgkFDOpwu6cks5qndgcgaJpZM4J0XqL .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/DittmarLab/HGTector/issues/8#issuecomment-245191758, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFIB3RL81U3pPjtghy47VvzYT9IpMJrdks5qnl-ugaJpZM4J0XqL.

gpharhay commented 8 years ago

Hi Qiyun: I am happy to report that I was able to reformat my list of RBBH (reciprocal best BLAST hits) as you specified to create an ortholog.db file and fire off the pipeline without error. This list was comprised of 16 genomes. I looked at some of the orthologs that were called HGT, and some appeared reasonable, others I will have to think about.

I tried to activate treer.pl for phylogenetic analyses by activating useDistance

useDistance=1

but the run failed with error messages such as

Use of uninitialized value within @a in subtraction (-) at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 533, line 11.

I'm attaching my config.txt file that works with the offending useDistance=1 commented out. When uncommented, the run fails

It seems that the pipeline is trying to use phylogenetic data before the alignments and trees are computed. Is there a workaround to compute the necessary phylogenetic data to successfully run the pipeline? Thanks.

Best, Greg config.txt

qiyunzhu commented 8 years ago

Hello Greg,

Glad to hear that the orthology function works. For Blast2GO, the file format is like:

SeqName Hit-Desc GO-Group GO-ID Term YP_989541.1 hsp33-like chaperonin P GO:0006457 protein folding YP_989541.1 hsp33-like chaperonin F GO:0051082 unfolded protein binding YP_989541.1 hsp33-like chaperonin C GO:0005737 cytoplasm YP_989541.1 hsp33-like chaperonin P GO:0006950 response to stress YP_989140.1 sensor histidine kinase F GO:0000156 two-component response regulator activity YP_989140.1 sensor histidine kinase P GO:0000160 two-component signal transduction system (phosphorelay) YP_989140.1 sensor histidine kinase P GO:0035556 intracellular signal transduction ...

One file per input genome. There is instruction in the HGTector GUI.

For the phylogenetic analysis, originally, my design was to have searcher.pl to retrieve the aligned part of the hit sequences, and to append them to each BLAST report file (Nexus format, under wkdir/search/genome/). If you grab one of them and take a look, are the sequences there? The phylogenetic pipeline will work on those sequences, and eventually append a tree as well as the phylogenetic distances to the same files.

Best, Qiyun

On Wed, Sep 7, 2016 at 2:23 PM, gpharhay notifications@github.com wrote:

Hi Qiyun: I am happy to report that I was able to reformat my list of RBBH (reciprocal best BLAST hits) as you specified to create an ortholog.db file and fire off the pipeline without error. This list was comprised of 16 genomes. I looked at some of the orthologs that were called HGT, and some appeared reasonable, others I will have to think about.

I tried to activate treer.pl for phylogenetic analyses by activating useDistance

useDistance=1

but the run failed with error messages such as

Use of uninitialized value within @a https://github.com/a in subtraction (-) at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 533, line 11.

I'm attaching my config.txt file that works with the offending useDistance=1 commented out. When uncommented, the run fails

It seems that the pipeline is trying to use phylogenetic data before the alignments and trees are computed. Is there a workaround to compute the necessary phylogenetic data to successfully run the pipeline? Thanks.

Best, Greg config.txt https://github.com/DittmarLab/HGTector/files/460353/config.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DittmarLab/HGTector/issues/8#issuecomment-245423783, or mute the thread https://github.com/notifications/unsubscribe-auth/AEMVN_627n_k3pXWD6Ux_wE7aKX6qjThks5qnys_gaJpZM4J0XqL .

gpharhay commented 8 years ago

Hi Qiyun:

Thank you for the Blast2GO file format spec. I poked around in B2G v 3.3 and found the Export –> Export Annotation -> Export Annotation (.annot,Custom,etc) menu drop down sequence will get you to a custom annotation screen that will generate a file as you specified. I haven’t generated this annotation file for the 16 genomes yet, but I plan to in the next couple of days. I’ll get back to you about this.

By the way, I didn’t find the GO annotation file specification in the HGTector GUI ( near the item related to functional annotation), so thanks again for providing it to me in the email.

Regarding the phylogenetic analyses, I did find the Nexus formatted alignment data at the end of the BLAST report for each query protein sequence. Since these are avaialbe, something else must be wrong. I looked at the place in analyzer.pl (line 533) where the pipeline died and changed

                                                            $hit{'coverage'} = $a[6] if $hasCoverage;

                                                            if ($useDistance){
                                                                            if ($#a >= 6){ $hit{'score'} = 1 - $a[6+$hasCoverage]; }
                                                                            elsif (!@hits){ $hit{'score'} = 0; }
                                                                            else { $hit{'score'} = $hits[$#hits]{'score'}; }
                                                            }

to $hit{'coverage'} = $a[6] if $hasCoverage;

                                                            if ($useDistance){
                                                                            if ($#a >= 6){ $hit{'score'} = 1 - $a[6]; }
                                                                            elsif (!@hits){ $hit{'score'} = 0; }
                                                                            else { $hit{'score'} = $hits[$#hits]{'score'}; }
                                                            }

The important change has to do with changing $a[6+$hasCoverage] to $a[6]. Once I did that, the pipeline did not throw an error at this line, but a new error flared up further down in the analyzer.pl at line 723 in the middle of the graphing fingerprints subroutine

Raw data are saved in result/statistics/rawdata.txt. You may conduct further analyses on these data. Press Enter to continue, or Ctrl+C to exit:

Graphing fingerprints with R...Use of uninitialized value in concatenation (.) or string at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 723, line 3. Problem while running this R command: lim0<-

Error: unexpected ';' in "lim0<-;" Execution halted Execution of analyzer.pl failed.

So I turned off graphFP and re-ran which generated new errors

Use of uninitialized value in concatenation (.) or string at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mean"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mean"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"stdev"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"stdev"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"min"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"min"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"max"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"max"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"median"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"median"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mad"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mad"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q1"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q1"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q3"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q3"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"cutoff"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"cutoff"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value in concatenation (.) or string at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mean"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mean"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"stdev"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"stdev"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"min"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"min"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"max"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"max"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"median"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"median"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mad"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mad"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q1"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q1"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q3"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q3"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"cutoff"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"cutoff"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value in concatenation (.) or string at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mean"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mean"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"stdev"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"stdev"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"min"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"min"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"max"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"max"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"median"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"median"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mad"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"mad"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q1"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q1"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q3"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"q3"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"cutoff"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Use of uninitialized value $h{"cutoff"} in sprintf at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 1258, line 3. Result is saved in result/statistics/fingerprint.txt.

It appears that fpN (fingerprint) variables aren’t be parsed into the phylogenetic arm of the script. Suggestions?

Greg

From: Qiyun Zhu notifications@github.com Reply-To: DittmarLab/HGTector reply@reply.github.com Date: Wednesday, September 7, 2016 at 5:14 PM To: DittmarLab/HGTector HGTector@noreply.github.com Cc: Gregory Harhay Gregory.Harhay@ARS.USDA.GOV, Author author@noreply.github.com Subject: Re: [DittmarLab/HGTector] Summarize HGT Events by gene orthology spawns runaway perl processes (#8)

Hello Greg,

Glad to hear that the orthology function works. For Blast2GO, the file format is like:

SeqName Hit-Desc GO-Group GO-ID Term YP_989541.1 hsp33-like chaperonin P GO:0006457 protein folding YP_989541.1 hsp33-like chaperonin F GO:0051082 unfolded protein binding YP_989541.1 hsp33-like chaperonin C GO:0005737 cytoplasm YP_989541.1 hsp33-like chaperonin P GO:0006950 response to stress YP_989140.1 sensor histidine kinase F GO:0000156 two-component response regulator activity YP_989140.1 sensor histidine kinase P GO:0000160 two-component signal transduction system (phosphorelay) YP_989140.1 sensor histidine kinase P GO:0035556 intracellular signal transduction ...

One file per input genome. There is instruction in the HGTector GUI.

For the phylogenetic analysis, originally, my design was to have searcher.pl to retrieve the aligned part of the hit sequences, and to append them to each BLAST report file (Nexus format, under wkdir/search/genome/). If you grab one of them and take a look, are the sequences there? The phylogenetic pipeline will work on those sequences, and eventually append a tree as well as the phylogenetic distances to the same files.

Best, Qiyun

On Wed, Sep 7, 2016 at 2:23 PM, gpharhay notifications@github.com wrote:

Hi Qiyun: I am happy to report that I was able to reformat my list of RBBH (reciprocal best BLAST hits) as you specified to create an ortholog.db file and fire off the pipeline without error. This list was comprised of 16 genomes. I looked at some of the orthologs that were called HGT, and some appeared reasonable, others I will have to think about.

I tried to activate treer.pl for phylogenetic analyses by activating useDistance

useDistance=1

but the run failed with error messages such as

Use of uninitialized value within @a https://github.com/a in subtraction (-) at /home/greg/apps/HGTector-master/scripts/analyzer.pl line 533, line 11.

I'm attaching my config.txt file that works with the offending useDistance=1 commented out. When uncommented, the run fails

It seems that the pipeline is trying to use phylogenetic data before the alignments and trees are computed. Is there a workaround to compute the necessary phylogenetic data to successfully run the pipeline? Thanks.

Best, Greg config.txt https://github.com/DittmarLab/HGTector/files/460353/config.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DittmarLab/HGTector/issues/8#issuecomment-245423783, or mute the thread https://github.com/notifications/unsubscribe-auth/AEMVN_627n_k3pXWD6Ux_wE7aKX6qjThks5qnys_gaJpZM4J0XqL .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/DittmarLab/HGTector/issues/8#issuecomment-245436316, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFIB3c_JNpijHn4AvzqF63htgURU9uSXks5qnzcugaJpZM4J0XqL.

This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.

qiyunzhu commented 8 years ago

Hello Greg,

Sorry for not replying for such a long time! I am been busy with other matters. I read your latest post, and thought that the program might have picked something other than phylogenetic distance for computing fingerprints, or, phylogenetic distance is simply not there. Therefore $a[6] directs the program to a wrong column.

If you grab a random Nexus file, do you see the phylogenetic tree in the end, and an additional column in the table?

For Blast2GO output parsing, thanks for the catch!

Best, Qiyun

gpharhay commented 8 years ago

Hi Qiyun:

Thanks for getting back to me. It was just a guess that $a[6] was the problem. I'm attaching a real nexus file generated by the pipeline. It is typical of the files generated, just smaller than most. I thought it would be easier if you looked at the file directly. I still haven't processed the B2G output for all of the genomes to run through your pipeline. I'll start that today. Best, Greg A4211_00030.txt

qiyunzhu commented 8 years ago

Hello Greg,

Thanks for your reply! I looked at the Nexus file you sent to me. The tree and the phylogenetic distance are not there. Therefore $a[6] is actually coverage.

If this does not work out eventually, maybe put aside the idea of using phylogenetic pipeline for the moment, and test with the default way (Blast) instead?

Best, Qiyun

gpharhay commented 8 years ago

Hi Qiyun:

Ok, i can put aside the idea of using the phylogenetic pipeline for the moment. I'm trying to generate enough evidence to show that the analyses of the BLAST results distributions and cutoffs are reasonable. I'll read the HGTector paper more closely. Do you have advice, not in the paper, that you could share about selecting and testing cutoffs?

When you get around to upgrading the phylogenetic pipeline, I'd be happy to test. Thanks for your help.

Greg

qiyunzhu commented 8 years ago

Hello Greg,

Thanks for your valuable feedback! I recommend that you see the plots then choose cutoffs. Ideally there should be one atypical peak and one typical peak. In reality this is not always the case. By fine-tuning the program parameters one may get more distinguishable peaks. Some tricks can be used to change the shape of the curve: 1) make BLAST parameters more stringent/relaxed (don't have to re-run BLAST, these parameters are for hit filtering) 2) change the bandwidth factor of kernel density estimation (bwF), 3) cut of outliers. These tricks are part of the HGTector pipeline. However, if you take rawdata.txt and do some of your own statistics using external programs (e.g., R), you may gain more insights. For example, you may already have a list of genes of interest. In R you can generate the same plots as HGTector does, then highlight these genes to see where they are located.

Best, Qiyun

gpharhay commented 8 years ago

Hi Qiyun:

Thanks very much for the advice. It makes a lot of sense and you saved me a lot of time not having to work out these tricks myself. I’ve been pulled away on end-of-fiscal year business, but I’ll soon get back to playing around with the parameters as you suggest.

Greg

From: Qiyun Zhu notifications@github.com Reply-To: DittmarLab/HGTector reply@reply.github.com Date: Sunday, September 25, 2016 at 3:19 AM To: DittmarLab/HGTector HGTector@noreply.github.com Cc: Gregory Harhay Gregory.Harhay@ARS.USDA.GOV, Author author@noreply.github.com Subject: Re: [DittmarLab/HGTector] Summarize HGT Events by gene orthology spawns runaway perl processes (#8)

Hello Greg,

Thanks for your valuable feedback! I recommend that you see the plots then choose cutoffs. Ideally there should be one atypical peak and one typical peak. In reality this is not always the case. By fine-tuning the program parameters one may get more distinguishable peaks. Some tricks can be used to change the shape of the curve: 1) make BLAST parameters more stringent/relaxed (don't have to re-run BLAST, these parameters are for hit filtering) 2) change the bandwidth factor of kernel density estimation (bwF), 3) cut of outliers. These tricks are part of the HGTector pipeline. However, if you take rawdata.txt and do some of your own statistics using external programs (e.g., R), you may gain more insights. For example, you may already have a list of genes of interest. In R you can generate the same plots as HGTector does, then highlight these genes to see where they are located.

Best, Qiyun

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/DittmarLab/HGTector/issues/8#issuecomment-249409290, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFIB3TqNaUsgUs-e8B5_y-Opl0EJw9caks5qti6agaJpZM4J0XqL.

This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.