homeveg / nuctools

software for analysis of chromatin feature occupancy profiles from high-throughput sequencing data
GNU General Public License v3.0
15 stars 3 forks source link

Need help for the program aggregate_profile.pl #10

Closed amrita1983 closed 6 years ago

amrita1983 commented 6 years ago

Hi, Can you please just tell me what do you mean by -reg genome_annotation.tab option in that program, what format and what file is needed i am not clear, can I use a GFF fiel for my species for that. Please let me know.

homeveg commented 6 years ago

Hi, sorry for a long delay with the answer. -reg stays for regions. As an input one should provide a tab-delimited text file with the description of all regions. Basically, the genome_annotation.tab file is a BED file, with only following columns required: region ID, region start, region stop, chromosome, DNA strand. You can simply convert a GFF to BED file or, if you want, you can use a script LoadAnnotation.BioMart.R to download complete annotation for genome of interest (there are 60 genomes currently available in BioMART).

amrita1983 commented 6 years ago

Hi,

thank you for your reply. I tried using a bed file using the following command perl -w /media/ngs/New\ Volume/Nuctools_data/nuctools-master/aggregate_profile.pl -in OCC_KK1_KK2/ChrA_C_glabrata_CBS138.KK1_KK2.w100.occ.gz -reg C_glabrata_final_anno.tab -al ChrA_C_glabrata_CBS138.name_template.aligned.txt.gz -av chr1.name_template.aggregate.txt --LibsizeNorm --PerBaseNorm --Cut_tail -lS 20000000 -chr ChrA_C_glabrata_CBS138

but these are the errors coming

unidentified strand for 22600 ( mito_C_glabrata_CBS138 18281 18367 gene + CaglfMt37 ): Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 469. Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 469. Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 469. Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 470. Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 470. Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 470. Use of uninitialized value in concatenation (.) or string at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 471. unidentified strand for 22601 ( mito_C_glabrata_CBS138 18281 18367 tRNA + CaglfMt37-T ): loading reads for the coordinates range [0..3001] from OCC_KK1_KK2/ChrA_C_glabrata_CBS138.KK1_KK2.w100.occ.gz file of 0 MBs. Please wait...Use of uninitialized value $last_line in pattern match (m//) at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 553. done in 0 seconds.0 seconds. 0.00 Mbs processed

0 strings from 4914 failed to load. 0 strings recovered 0 regions with annotation loaded 4884 coordinates out of range Calculate occupancy processing all chromosomes sorting occupancy for chromosome ChrA_C_glabrata_CBS138 by coordinate... chromosome ChrA_C_glabrata_CBS138 of 30 bases sorted Start processing... saving aggregate profile matrix to chr1.name_template.aggregate.txt.delta_100_1500.txt Use of uninitialized value in print at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 1037. Use of uninitialized value in print at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 1037. Use of uninitialized value in print at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 1037.

homeveg commented 6 years ago

Hi! I see some potential problems:

amrita1983 commented 6 years ago

Hi,

I am running this command for after correction:

perl -w nuctools-master/aggregate_profile.pl -reg C_glabrata_annotation.tab -idC 0 -chrC 3 -strC 4 -sC 1 -eC 2 -pbN -lsN -lS 75000000 -chr ChrB -al ChrB_C_glabrata_CBS138.KK1_KK2.occ_matrix -av ChrB_C_glabrata_CBS138.KK1_KK2.aggregate -in OCC_KK1_KK2/ChrB_C_glabrata_CBS138.KK1_KK2.w100.occ -upD 1000 -downD 1000 -z**

but still it is giving following error Argument "B" isn't numeric in numeric eq (==) at nuctools-master/aggregate_profile.pl line 421, line 1. Argument "A" isn't numeric in numeric eq (==) at nuctools-master/aggregate_profile.pl line 421, line 1. Argument "B" isn't numeric in numeric eq (==) at nuctools-master/aggregate_profile.pl line 421, line 2.

loading reads for the coordinates range [0..1457688] from OCC_KK1_KK2/ChrB_C_glabrata_CBS138.KK1_KK2.w100.occ file of 0 MBs. Please wait...Use of uninitialized value $last_line in pattern match (m//) at nuctools-master/aggregate_profile.pl line 553. 0 Mbs processed in 0 seconds. ^M done in 0 seconds. 0.00 Mbs processed

0 strings from 5021 failed to load. 0 strings recovered 235 regions with annotation loaded 0 coordinates out of range Calculate occupancy processing all chromosomes sorting occupancy for chromosome ChrB by coordinate... chromosome ChrB of 5021 bases sorted Start processing...Use of uninitialized value $apply_DivSum_normalization in string eq at nuctools-master/aggregate_profile.pl line 907. Use of uninitialized value $apply_DivSum_normalization in string eq at nuctools-master/aggregate_profile.pl line 907. Use of uninitialized value $apply_DivSum_normalization in string eq at nuctools-master/aggregate_profile.pl line 907.

please note that my species is a fungus Candida glabrata, whose chromosomes are ChrA, ChrB like that. And the output file I am getting ChrB_C_glabrata_CBS138.KK1_KK2.aggregate.delta_1000_1000.txt is having no information other than -1000 -900 -800 -700 -600 -500 ........

please help

homeveg commented 6 years ago

Hi,

current version of a script does not support letters as a chromosome IDs. The easiest way will be to replace the ChrA with chr1, ChrB with chr2 and so on in your genome annotation file. Sorry for inconvenience, but at the moment this is the only solution I can offer you. I will put your request to the list of thing to do, but at the moment I don't have time for that.

Best

amrita1983 commented 6 years ago

Hi,

I have replaced the chromosomes names with chr1 and so on but still facing some error can you please help me to find the solution.

my command is : perl -w nuctools-master/aggregate_profile.pl -reg C_glabrata_annotation.tab -idC 0 -chrC 3 -strC 4 -sC 1 -eC 2 -pbN -lsN -lS 75000000 -chr chr2 -al chr2.KK1_KK2.occ_matrix -av chr2.KK1_KK2.aggregate -in OCC_KK1_KK2/chr2.KK1_KK2.w100.occ -upD 1000 -downD 1000

Error: loading reads for the coordinates range [0..503365] from OCC_KK1_KK2/chr2.KK1_KK2.w100.occ file of 0 MBs. Please wait...Use of uninitialized value $last_line in pattern match (m//) at nuctools-master/aggregate_profile.pl line 553. 0 Mbs processed in 0 seconds. ^M done in 0 seconds. 0.00 Mbs processed

0 strings from 5021 failed to load. 0 strings recovered 235 regions with annotation loaded 0 coordinates out of range Calculate occupancy processing all chromosomes sorting occupancy for chromosome chr2 by coordinate... chromosome chr2 of 5021 bases sorted Start processing...Use of uninitialized value $apply_DivSum_normalization in string eq at nuctools-master/aggregate_profile.pl line 907. Use of uninitialized value $apply_DivSum_normalization in string eq at nuctools-master/aggregate_profile.pl line 907.

homeveg commented 6 years ago

Hi, thanks for report. I've pushed a new version of the script aggregate_profile.pl to Git. The warning message should disappear. Nevertheless, the warning message not explaining why the run fails.

In order to help you, please provide here complete run log, including all info at the beginning of the script output. You can attach it to your message with drug&drop as a file. In addition, it would be nice to see first couple of lines of your genome annotation table.

What I can see from the name of the input file (chr2.KK1_KK2.w100.occ) you've calculated occupancy using default parameter for window size (100). To get better occupancy profile use -w 1 option with bed2occupancy_average.pl script.

amrita1983 commented 6 years ago

Thank you for the help. I really need to complete the work.

Here I am pasting all the details

Here is the command i have used for the aggregate profile program perl -w /home/ngs/Downloads/NucTools/aggregate_profile.pl --regions=C_glabrata_annotation.tab --average_aligned=chr1_AT_stable-nucs.txt --chromosome_col=3 --region_start_column=1 --region_end_column=2 --strand_column=4 --GeneId_column=0 --ignore_strand --input=OCC_KK1_KK2/chr1.KK1_KK2.w1.occ --chromosome=chr1 --verbose --aligned=output.aligned.tab.gz --useCenter --upstream_delta=2000 --downstream_delta=2000

here is some lines of annotation file CAGL0A00105g 1608 2636 chr1 - CAGL0A00110g 1608 4809 chr1 - CAGL0A00116g 2671 4809 chr1 - CAGL0A00132g 11697 13042 chr1 + CAGL0A00154g 14977 15886 chr1 + CAGL0A00165g 17913 19017 chr1 - CAGL0A00187g 19120 21857 chr1 - CAGL0A00209g 22057 24493 chr1 + CAGL0A00231g 24957 27649 chr1 + CAGL0A00253g 27686 28779 chr1 - CAGL0A00275g 28538 30395 chr1 + CAGL0A00297g 29618 31199 chr1 - CAGL0A00319g 30897 32583 chr1 + CAGL0A00341g 32727 33858 chr1 + CAGL0A00363g 33938 36387 chr1 - CAGL0A00385g 36449 40397 chr1 -

AggregateProfile.log chr1_AT_stable-nucs.txt.delta_2000_2000.txt nohup_log.txt

homeveg commented 6 years ago

1) Update aggregate_profile.pl script to the latest version. I fixed there the issue with following warning message: Argument "mito" isn't numeric in numeric eq (==) at /home/ngs/Downloads/NucTools/aggregate_profile.pl line 421 2) you have to use -w 1 flag with aggregate_profile.pl as well, because default is 100. 3) Again, update aggregate_profile.pl script! Following warning message should not appear at all since it's referring to non-existing normalization method: Use of uninitialized value $apply_DivSum_normalization in string eq at /home/ngs/Downloads/NucTools/aggregate_profile.pl line 907 4) Choose proper normalization method (just a suggestion). Either use --AgregateProfile or --PerBaseNorm normalization options 5) You can start your script without -w option, to suppress most of the non-relevant warnings: perl aggregate_profile.pl --regions=C_glabrata_annotation.tab ...

amrita1983 commented 6 years ago

Thank you for the help, I will try with all of your suggestions.

Is that Annotation file format is correct and according to the command?

homeveg commented 6 years ago

Yes, it's fine, I've corrected the script to accept as well ChrTXT as chromosome ID so it should not produce errors. The rest to me looks fine. By some reason you were using very outdated version what causes the problem. Updating should help you.

amrita1983 commented 6 years ago

I have updated the code as its now working find, really thank you so much for the help. I have one more doubt, can you please just give me any idea about that three different command type you have in your bash file?

bash_aggregate_profile_GSM2183911_AT_LOCs.sh
bash_aggregate_profile_GSM2183911_AT_fuzzy-nucs.sh bash_aggregate_profile_GSM2183911_AT_stable-nucs.sh

what is the meaning or difference between these three? LOCs/fuzzy_nucs/stable-nucs you can also share some article if you have anything on this.

homeveg commented 6 years ago

Have a look in the paper: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3580-2

This command line shell-scrips was used to illustrate the paper.

amrita1983 commented 6 years ago

Hello, Sir,

Sorry to bother you once more, the program average_replicates.pl is not giving any output in the file, is that also a chr numeric dependent program, although the error is not coming like that, but out pur file has nothing except headers.

command: perl -w nuctools-master/average_replicates.pl --dir=/media/ngs/New\ Volume/Nuctools_data/OCC_KK1_KK2 --output=/media/ngs/New\ Volume/Nuctools_data/OCC_KK1_KK2/KK1_KK2_occup_tab.txt --pattern=".KK1_KK2.w1.occ.gz" --coordsCol=0 --occupCol=1 --printData --sum

Output log: Duplicate specification "list|l=s" for option "list" Duplicate specification "list|l=s" for option "l" ZGIP support disabled

13-6-2018 11:39:31

calculating StDev, Variance, Sum and average. Results will be saved to /media/ngs/New Volume/Nuctools_data/OCC_KK1_KK2/KK1_KK2_occup_tab.txt processing 0 entries... done!

13-6-2018 11:39:31

Output file: position Mean Sum stdev Rel.Error

only these headers

Can I have your mail id sir if it is ok with you so.

homeveg commented 6 years ago

Hi, 1) download updated version of a script average_replicates.pl. I've just updated it to add more file reading status information and extra check for a --gzip flag 2) according to the value of parameter --pattern from your command line, occupancy data is compressed. So, use --gzip flag to enable it. This should solve the issue... 3) check the file pattern, just in case.

amrita1983 commented 6 years ago

Sir,

Thanks for the help for all the previous steps, it worked fine and the paper also helped me a lot thank you once again. I am using the stable_nucs_replicates.pl the one you have updated 6 days ago but not getting any output, is that have the same Chr nomenclature issue, if so can you please help me to update the same.

Command: perl -w /home/ngs/Downloads/NucTools/stable_nucs_replicates.pl --input=../OCC_KK1_KK2 --output=out.bed.gz --chromosome=ChrA --printData --StableThreshold --fuzzy --fileExtention .occ.gz --gzip

I am using the directory for .occ.gz files as input.

Log file: ZGIP support enabled

19-6-2018 13:51:39

process 0 files. Please wait... calcualting StDev, Variance and average. Results will be saved to out.bed.gz processing 0 entries... done!

19-6-2018 13:51:39

My input directory has the files like this: ChrA.KK1_KK2.w1.occ.gz ChrB.KK1_KK2.w1.occ.gz ChrC.KK1_KK2.w1.occ.gz ChrD.KK1_KK2.w1.occ.gz ChrE.KK1_KK2.w1.occ.gz ChrF.KK1_KK2.w1.occ.gz ChrG.KK1_KK2.w1.occ.gz ChrH.KK1_KK2.w1.occ.gz ChrI.KK1_KK2.w1.occ.gz ChrJ.KK1_KK2.w1.occ.gz ChrK.KK1_KK2.w1.occ.gz ChrL.KK1_KK2.w1.occ.gz ChrM.KK1_KK2.w1.occ.gz

homeveg commented 6 years ago

1) missing "=" sign in parameter definition. If you are using long format of parameters (--parameter) always use = with it: --fileExtention="occ.gz" 2) in the input folder the script expecting to find multiple files from replicates belonging to only chromosome, specified in the --chromosome= parameter. Due to the (1) your command will not work, but due to the (2) it does not make sense, because you are trying to treat different chromosomes of a different length as a replicates.

amrita1983 commented 6 years ago

Thank you for the reply Sir. I have corrected the command and files but still have no output Sir.

Command: perl /home/ngs/Downloads/NucTools/stable_nucs_replicates.pl --inputDir=/media/ngs/Data/Nuctools_data/Stable_nuc_replicates/ChrA_KK1_KK2_vs_KK3_KK4 --output=out.bed.gz --chromosome=ChrA --printData --StableThreshold --fileExtention=".KK1_KK2.w1.occ.gz" --fuzzy --gzip

File structure of the folder: (KK1_KK2 vs KK3_KK4 they are replicates) ChrA.KK1_KK2.w1.occ.gz ChrA.KK3_KK4.w1.occ.gz

log file is showing the same problem and no output is coming Sir. ZGIP support enabled

20-6-2018 10:54:21

process 0 files. Please wait... calcualting StDev, Variance and average. Results will be saved to out.bed.gz processing 0 entries... done!

20-6-2018 10:54:21

homeveg commented 6 years ago

You previous command --fileExtention=".KK1_KK2.w1.occ.gz" did not work, because of trailing "." sign in the beginning of the pattern. As well, pattern should be common for all files in the folder. You have KK1_KK2 and KK3_KK4 files but in pattern specified only KK1_KK2:

--fileExtention="w1.occ.gz" or --fileExtention="occ.gz" should work for you.

amrita1983 commented 6 years ago

Sir,

Even though I am trying with the same --fileExtention="w1.occ.gz" or --fileExtention="occ.gz" result is still not coming.

homeveg commented 6 years ago

"process 0 files. Please wait..." message means program can't find any files in the directory with the file name template defined by --fileExtention parameter. 1) Try to specify full path to the folder with your occupancy files. 2) Check directory/files permission

amrita1983 commented 6 years ago

Sir, I have tried with all the options that you have mentioned but unfortunately it is not working at all

Command: perl /home/ngs/Downloads/NucTools/stable_nucs_replicates.pl --inputDir=/media/ngs/Data/Nuctools_data/Stable_nuc_replicates/ChrB_KK1_KK2_vs_KK3_KK4 --output=out.bed.gz --coordsCol=0 --occupCol=1 --chromosome=ChrB --printData --StableThreshold --fileExtention="occ.gz" --fuzzy --gzip

Output log file: ZGIP support enabled

22-6-2018 10:39:16

process 0 files. Please wait... calcualting StDev, Variance and average. Results will be saved to out.bed.gz processing 0 entries... done!

22-6-2018 10:39:16

Sir, I am sending two occ file so that you can please try it in your system once. It would be a great help Sir, its is a very small file.

ChrB.KK1_KK2.occ.gz ChrB.KK3_KK4.occ.gz

homeveg commented 6 years ago

Hello, Thanks for the example. I've used your data and actually found a bug in the "average_replicate.pl" script, which I've tested by a mistake. Script was working correctly only if you start it from the folder with your data and specify full path to the output. I've pushed a new version to a Git. Please update.

Regarding stable_nucs_replicates.pl script - in my hands it's working. I've updated it as well removing some very minor bugs. So, please as well update it. I was testing it either using full or short path to folder with the data:

perl /home/ngs/Downloads/NucTools/stable_nucs_replicates.pl -in ChrB_KK1_KK2_vs_KK3_KK4 -out out.bed.gz -cC 0 -oC 1 -chr ChrB -p --StableThreshold=0.5 -p "occ.gz" --fuzzy --gzip

The value for --StableThreshold=0.5 was missing.

amrita1983 commented 6 years ago

Thank you so much, stable_nucs_replicates.pl program worked for me also. Thank you for all the effort. I am using CMB for plots now, do i need to put .occ files of aggregate_profile.pl output once to generate the plot or any other. The example given in the Manual pdf does not seems to be the .occ files i have got.

homeveg commented 6 years ago

As an input for CMB you use a matrix with aligned occupancy values produced by aggregate_profile.pl plot. Normally it's a biggest file in the output. I should say, regarding CMB I can't provide you practically no support, because at the moment I don't have a Matlab license and can't continue development on the project. The CMB was developed in Matlab 2014a and compiled for Windows 7 in Matlab 2015a/b and might not be working correctly under Windows 10. Therefore, the most reliable way of using CMB is to have a valid MatLab installation and run the CMB form source. When you start a CMB, I recommend to re-activate all options manually (namely choose normalization options, activate/deactivate all options before you press the "Run analysis" button. Unfortunately there is no other alternative at the moment.

amrita1983 commented 6 years ago

Is there any program we can use to generate the matrix? or how I can get it the matrix? I have installed it in my Windows machine , but .occ files are not the one they are taking, can you please tell me how can I generate the matrix. Rest of the things i will try and run.

homeveg commented 6 years ago

aggregate_profile.pl script does generate the matrix which can be visualized as a heatmap. The name of such file you specify when running aggregate_profile.pl script in option --aligned= or -al. In last your example of using this script, you where specifying it like following: --aligned=output.aligned.tab.gz So, if you used same file name for every chromosome, most likely you overwritten it every time and now this file contains very last chromosome you've analyzed with the aggregate_profile.pl script

amrita1983 commented 6 years ago

I have regenerated those matrix files and run them in CMB. Thanks for the help. One query is there how can I see all the chromosome in a same graph (is it possible) or anything like this in the same graph.

capture

homeveg commented 6 years ago

Unfortunately CMB does not allow to combine different datasets. But second output file of aggregate_profile.pl script, is surprisingly an aggregate profile :) The 2 columns file is specified by the parameter --average_aligned=. First column is a coordinate, second - average occupancy. One can just collect this files together and plot profiles in Excel, for example

amrita1983 commented 6 years ago

Sir,

I have one query how much the nucleosome occupancy score varies? I mean what can be the max and min limit of the score? As i have to set a cutoff for that compare condition program which is default .8 and .5.

homeveg commented 6 years ago

Hello, please update the script compare_two_conditions.pl. I've just corrected there a minor bug... Occupancy should be on average equal to experimental coverage (Total Nr. of mapped reads divided by the genome length). In the reality, of cause, you have a range from 0 (no reads at all) to many thousands of reads corresponding to low complexity regions. The value of normalized difference is always spanning from 0 to +2. Therefore, setting the lower filter will define the "similarity" (smaller the number, closer the values). Upper filter helps to identify regions with higher differences.

amrita1983 commented 6 years ago

Thanks, but the code is not showing the updated date till now. Still it is showing updated 20 days ago. Actually the occupancy data of my sample in a chromosome with window size 10 is varying from 11 to up-to thousands. So i am getting confusion that how much they will get normalized in later stages. Am i doing correct or not

homeveg commented 6 years ago

no idea why it did not updated.... I did it now. script compare_two_conditions.pl does not change occupancy values. What is doing - it's calculating the difference between occupancy values of sample 1 and sample 2, and re scale the absolute value of a difference from 0 to 2.

homeveg commented 6 years ago

Sorry, I've just change the code back to previous. Everything was correct. The only problem is the default thresholds values. So, again. Upper threshold is now set to 0.99 by default and lower threshold to -0.99. If the normalized difference is above upper threshold, than the nucleosome occupancy increased in first condition and decrease in second. If the difference is below lower threshold, than the nucleosome is lost in first condition and increased in second condition correspondingly.