Closed amrita1983 closed 6 years ago
Hi, sorry for a long delay with the answer. -reg stays for regions. As an input one should provide a tab-delimited text file with the description of all regions. Basically, the genome_annotation.tab file is a BED file, with only following columns required: region ID, region start, region stop, chromosome, DNA strand. You can simply convert a GFF to BED file or, if you want, you can use a script LoadAnnotation.BioMart.R to download complete annotation for genome of interest (there are 60 genomes currently available in BioMART).
Hi,
thank you for your reply. I tried using a bed file using the following command perl -w /media/ngs/New\ Volume/Nuctools_data/nuctools-master/aggregate_profile.pl -in OCC_KK1_KK2/ChrA_C_glabrata_CBS138.KK1_KK2.w100.occ.gz -reg C_glabrata_final_anno.tab -al ChrA_C_glabrata_CBS138.name_template.aligned.txt.gz -av chr1.name_template.aggregate.txt --LibsizeNorm --PerBaseNorm --Cut_tail -lS 20000000 -chr ChrA_C_glabrata_CBS138
but these are the errors coming
unidentified strand for 22600 ( mito_C_glabrata_CBS138 18281 18367 gene + CaglfMt37 ): Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 469. Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 469. Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 469. Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 470. Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 470. Use of uninitialized value within @strands in string eq at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 470. Use of uninitialized value in concatenation (.) or string at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 471. unidentified strand for 22601 ( mito_C_glabrata_CBS138 18281 18367 tRNA + CaglfMt37-T ): loading reads for the coordinates range [0..3001] from OCC_KK1_KK2/ChrA_C_glabrata_CBS138.KK1_KK2.w100.occ.gz file of 0 MBs. Please wait...Use of uninitialized value $last_line in pattern match (m//) at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 553. done in 0 seconds.0 seconds. 0.00 Mbs processed
0 strings from 4914 failed to load. 0 strings recovered 0 regions with annotation loaded 4884 coordinates out of range Calculate occupancy processing all chromosomes sorting occupancy for chromosome ChrA_C_glabrata_CBS138 by coordinate... chromosome ChrA_C_glabrata_CBS138 of 30 bases sorted Start processing... saving aggregate profile matrix to chr1.name_template.aggregate.txt.delta_100_1500.txt Use of uninitialized value in print at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 1037. Use of uninitialized value in print at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 1037. Use of uninitialized value in print at /media/ngs/New Volume/Nuctools_data/nuctools-master/aggregate_profile.pl line 1037.
Hi! I see some potential problems:
--gzip
flag at every step of a pipeline.Hi,
I am running this command for after correction:
perl -w nuctools-master/aggregate_profile.pl -reg C_glabrata_annotation.tab -idC 0 -chrC 3 -strC 4 -sC 1 -eC 2 -pbN -lsN -lS 75000000 -chr ChrB -al ChrB_C_glabrata_CBS138.KK1_KK2.occ_matrix -av ChrB_C_glabrata_CBS138.KK1_KK2.aggregate -in OCC_KK1_KK2/ChrB_C_glabrata_CBS138.KK1_KK2.w100.occ -upD 1000 -downD 1000 -z**
but still it is giving following error
Argument "B" isn't numeric in numeric eq (==) at nuctools-master/aggregate_profile.pl line 421,
loading reads for the coordinates range [0..1457688] from OCC_KK1_KK2/ChrB_C_glabrata_CBS138.KK1_KK2.w100.occ file of 0 MBs. Please wait...Use of uninitialized value $last_line in pattern match (m//) at nuctools-master/aggregate_profile.pl line 553. 0 Mbs processed in 0 seconds. ^M done in 0 seconds. 0.00 Mbs processed
0 strings from 5021 failed to load. 0 strings recovered 235 regions with annotation loaded 0 coordinates out of range Calculate occupancy processing all chromosomes sorting occupancy for chromosome ChrB by coordinate... chromosome ChrB of 5021 bases sorted Start processing...Use of uninitialized value $apply_DivSum_normalization in string eq at nuctools-master/aggregate_profile.pl line 907. Use of uninitialized value $apply_DivSum_normalization in string eq at nuctools-master/aggregate_profile.pl line 907. Use of uninitialized value $apply_DivSum_normalization in string eq at nuctools-master/aggregate_profile.pl line 907.
please note that my species is a fungus Candida glabrata, whose chromosomes are ChrA, ChrB like that. And the output file I am getting ChrB_C_glabrata_CBS138.KK1_KK2.aggregate.delta_1000_1000.txt is having no information other than -1000 -900 -800 -700 -600 -500 ........
please help
Hi,
current version of a script does not support letters as a chromosome IDs. The easiest way will be to replace the ChrA with chr1, ChrB with chr2 and so on in your genome annotation file. Sorry for inconvenience, but at the moment this is the only solution I can offer you. I will put your request to the list of thing to do, but at the moment I don't have time for that.
Best
Hi,
I have replaced the chromosomes names with chr1 and so on but still facing some error can you please help me to find the solution.
my command is : perl -w nuctools-master/aggregate_profile.pl -reg C_glabrata_annotation.tab -idC 0 -chrC 3 -strC 4 -sC 1 -eC 2 -pbN -lsN -lS 75000000 -chr chr2 -al chr2.KK1_KK2.occ_matrix -av chr2.KK1_KK2.aggregate -in OCC_KK1_KK2/chr2.KK1_KK2.w100.occ -upD 1000 -downD 1000
Error: loading reads for the coordinates range [0..503365] from OCC_KK1_KK2/chr2.KK1_KK2.w100.occ file of 0 MBs. Please wait...Use of uninitialized value $last_line in pattern match (m//) at nuctools-master/aggregate_profile.pl line 553. 0 Mbs processed in 0 seconds. ^M done in 0 seconds. 0.00 Mbs processed
0 strings from 5021 failed to load. 0 strings recovered 235 regions with annotation loaded 0 coordinates out of range Calculate occupancy processing all chromosomes sorting occupancy for chromosome chr2 by coordinate... chromosome chr2 of 5021 bases sorted Start processing...Use of uninitialized value $apply_DivSum_normalization in string eq at nuctools-master/aggregate_profile.pl line 907. Use of uninitialized value $apply_DivSum_normalization in string eq at nuctools-master/aggregate_profile.pl line 907.
Hi, thanks for report. I've pushed a new version of the script aggregate_profile.pl to Git. The warning message should disappear. Nevertheless, the warning message not explaining why the run fails.
In order to help you, please provide here complete run log, including all info at the beginning of the script output. You can attach it to your message with drug&drop as a file. In addition, it would be nice to see first couple of lines of your genome annotation table.
What I can see from the name of the input file (chr2.KK1_KK2.w100.occ) you've calculated occupancy using default parameter for window size (100). To get better occupancy profile use -w 1
option with bed2occupancy_average.pl
script.
Thank you for the help. I really need to complete the work.
Here I am pasting all the details
Here is the command i have used for the aggregate profile program perl -w /home/ngs/Downloads/NucTools/aggregate_profile.pl --regions=C_glabrata_annotation.tab --average_aligned=chr1_AT_stable-nucs.txt --chromosome_col=3 --region_start_column=1 --region_end_column=2 --strand_column=4 --GeneId_column=0 --ignore_strand --input=OCC_KK1_KK2/chr1.KK1_KK2.w1.occ --chromosome=chr1 --verbose --aligned=output.aligned.tab.gz --useCenter --upstream_delta=2000 --downstream_delta=2000
here is some lines of annotation file CAGL0A00105g 1608 2636 chr1 - CAGL0A00110g 1608 4809 chr1 - CAGL0A00116g 2671 4809 chr1 - CAGL0A00132g 11697 13042 chr1 + CAGL0A00154g 14977 15886 chr1 + CAGL0A00165g 17913 19017 chr1 - CAGL0A00187g 19120 21857 chr1 - CAGL0A00209g 22057 24493 chr1 + CAGL0A00231g 24957 27649 chr1 + CAGL0A00253g 27686 28779 chr1 - CAGL0A00275g 28538 30395 chr1 + CAGL0A00297g 29618 31199 chr1 - CAGL0A00319g 30897 32583 chr1 + CAGL0A00341g 32727 33858 chr1 + CAGL0A00363g 33938 36387 chr1 - CAGL0A00385g 36449 40397 chr1 -
AggregateProfile.log chr1_AT_stable-nucs.txt.delta_2000_2000.txt nohup_log.txt
1) Update aggregate_profile.pl
script to the latest version. I fixed there the issue with following warning message:
Argument "mito" isn't numeric in numeric eq (==) at /home/ngs/Downloads/NucTools/aggregate_profile.pl line 421
2) you have to use -w 1
flag with aggregate_profile.pl as well, because default is 100.
3) Again, update aggregate_profile.pl
script! Following warning message should not appear at all since it's referring to non-existing normalization method:
Use of uninitialized value $apply_DivSum_normalization in string eq at /home/ngs/Downloads/NucTools/aggregate_profile.pl line 907
4) Choose proper normalization method (just a suggestion). Either use --AgregateProfile
or --PerBaseNorm
normalization options
5) You can start your script without -w option, to suppress most of the non-relevant warnings:
perl aggregate_profile.pl --regions=C_glabrata_annotation.tab ...
Thank you for the help, I will try with all of your suggestions.
Is that Annotation file format is correct and according to the command?
Yes, it's fine, I've corrected the script to accept as well ChrTXT as chromosome ID so it should not produce errors. The rest to me looks fine. By some reason you were using very outdated version what causes the problem. Updating should help you.
I have updated the code as its now working find, really thank you so much for the help. I have one more doubt, can you please just give me any idea about that three different command type you have in your bash file?
bash_aggregate_profile_GSM2183911_AT_LOCs.sh
bash_aggregate_profile_GSM2183911_AT_fuzzy-nucs.sh
bash_aggregate_profile_GSM2183911_AT_stable-nucs.sh
what is the meaning or difference between these three? LOCs/fuzzy_nucs/stable-nucs you can also share some article if you have anything on this.
Have a look in the paper: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3580-2
This command line shell-scrips was used to illustrate the paper.
Hello, Sir,
Sorry to bother you once more, the program average_replicates.pl is not giving any output in the file, is that also a chr numeric dependent program, although the error is not coming like that, but out pur file has nothing except headers.
command: perl -w nuctools-master/average_replicates.pl --dir=/media/ngs/New\ Volume/Nuctools_data/OCC_KK1_KK2 --output=/media/ngs/New\ Volume/Nuctools_data/OCC_KK1_KK2/KK1_KK2_occup_tab.txt --pattern=".KK1_KK2.w1.occ.gz" --coordsCol=0 --occupCol=1 --printData --sum
13-6-2018 11:39:31
Output file: position Mean Sum stdev Rel.Error
only these headers
Can I have your mail id sir if it is ok with you so.
Hi,
1) download updated version of a script average_replicates.pl
. I've just updated it to add more file reading status information and extra check for a --gzip
flag
2) according to the value of parameter --pattern
from your command line, occupancy data is compressed. So, use --gzip
flag to enable it. This should solve the issue...
3) check the file pattern, just in case.
Sir,
Thanks for the help for all the previous steps, it worked fine and the paper also helped me a lot thank you once again. I am using the stable_nucs_replicates.pl the one you have updated 6 days ago but not getting any output, is that have the same Chr nomenclature issue, if so can you please help me to update the same.
Command: perl -w /home/ngs/Downloads/NucTools/stable_nucs_replicates.pl --input=../OCC_KK1_KK2 --output=out.bed.gz --chromosome=ChrA --printData --StableThreshold --fuzzy --fileExtention .occ.gz --gzip
I am using the directory for .occ.gz files as input.
My input directory has the files like this: ChrA.KK1_KK2.w1.occ.gz ChrB.KK1_KK2.w1.occ.gz ChrC.KK1_KK2.w1.occ.gz ChrD.KK1_KK2.w1.occ.gz ChrE.KK1_KK2.w1.occ.gz ChrF.KK1_KK2.w1.occ.gz ChrG.KK1_KK2.w1.occ.gz ChrH.KK1_KK2.w1.occ.gz ChrI.KK1_KK2.w1.occ.gz ChrJ.KK1_KK2.w1.occ.gz ChrK.KK1_KK2.w1.occ.gz ChrL.KK1_KK2.w1.occ.gz ChrM.KK1_KK2.w1.occ.gz
1) missing "=" sign in parameter definition. If you are using long format of parameters (--parameter) always use = with it:
--fileExtention="occ.gz"
2) in the input folder the script expecting to find multiple files from replicates belonging to only chromosome, specified in the --chromosome=
parameter.
Due to the (1) your command will not work, but due to the (2) it does not make sense, because you are trying to treat different chromosomes of a different length as a replicates.
Thank you for the reply Sir. I have corrected the command and files but still have no output Sir.
Command: perl /home/ngs/Downloads/NucTools/stable_nucs_replicates.pl --inputDir=/media/ngs/Data/Nuctools_data/Stable_nuc_replicates/ChrA_KK1_KK2_vs_KK3_KK4 --output=out.bed.gz --chromosome=ChrA --printData --StableThreshold --fileExtention=".KK1_KK2.w1.occ.gz" --fuzzy --gzip
File structure of the folder: (KK1_KK2 vs KK3_KK4 they are replicates) ChrA.KK1_KK2.w1.occ.gz ChrA.KK3_KK4.w1.occ.gz
You previous command --fileExtention=".KK1_KK2.w1.occ.gz"
did not work, because of trailing "." sign in the beginning of the pattern. As well, pattern should be common for all files in the folder. You have KK1_KK2 and KK3_KK4 files but in pattern specified only KK1_KK2:
--fileExtention="w1.occ.gz"
or --fileExtention="occ.gz"
should work for you.
Sir,
Even though I am trying with the same --fileExtention="w1.occ.gz" or --fileExtention="occ.gz" result is still not coming.
"process 0 files. Please wait..." message means program can't find any files in the directory with the file name template defined by --fileExtention
parameter.
1) Try to specify full path to the folder with your occupancy files.
2) Check directory/files permission
Sir, I have tried with all the options that you have mentioned but unfortunately it is not working at all
Command: perl /home/ngs/Downloads/NucTools/stable_nucs_replicates.pl --inputDir=/media/ngs/Data/Nuctools_data/Stable_nuc_replicates/ChrB_KK1_KK2_vs_KK3_KK4 --output=out.bed.gz --coordsCol=0 --occupCol=1 --chromosome=ChrB --printData --StableThreshold --fileExtention="occ.gz" --fuzzy --gzip
Sir, I am sending two occ file so that you can please try it in your system once. It would be a great help Sir, its is a very small file.
Hello, Thanks for the example. I've used your data and actually found a bug in the "average_replicate.pl" script, which I've tested by a mistake. Script was working correctly only if you start it from the folder with your data and specify full path to the output. I've pushed a new version to a Git. Please update.
Regarding stable_nucs_replicates.pl
script - in my hands it's working. I've updated it as well removing some very minor bugs. So, please as well update it.
I was testing it either using full or short path to folder with the data:
perl /home/ngs/Downloads/NucTools/stable_nucs_replicates.pl -in ChrB_KK1_KK2_vs_KK3_KK4 -out out.bed.gz -cC 0 -oC 1 -chr ChrB -p --StableThreshold=0.5 -p "occ.gz" --fuzzy --gzip
The value for --StableThreshold=0.5 was missing.
Thank you so much, stable_nucs_replicates.pl program worked for me also. Thank you for all the effort. I am using CMB for plots now, do i need to put .occ files of aggregate_profile.pl output once to generate the plot or any other. The example given in the Manual pdf does not seems to be the .occ files i have got.
As an input for CMB you use a matrix with aligned occupancy values produced by aggregate_profile.pl plot. Normally it's a biggest file in the output. I should say, regarding CMB I can't provide you practically no support, because at the moment I don't have a Matlab license and can't continue development on the project. The CMB was developed in Matlab 2014a and compiled for Windows 7 in Matlab 2015a/b and might not be working correctly under Windows 10. Therefore, the most reliable way of using CMB is to have a valid MatLab installation and run the CMB form source. When you start a CMB, I recommend to re-activate all options manually (namely choose normalization options, activate/deactivate all options before you press the "Run analysis" button. Unfortunately there is no other alternative at the moment.
Is there any program we can use to generate the matrix? or how I can get it the matrix? I have installed it in my Windows machine , but .occ files are not the one they are taking, can you please tell me how can I generate the matrix. Rest of the things i will try and run.
aggregate_profile.pl script does generate the matrix which can be visualized as a heatmap. The name of such file you specify when running aggregate_profile.pl script in option --aligned=
or -al
.
In last your example of using this script, you where specifying it like following:
--aligned=output.aligned.tab.gz
So, if you used same file name for every chromosome, most likely you overwritten it every time and now this file contains very last chromosome you've analyzed with the aggregate_profile.pl script
I have regenerated those matrix files and run them in CMB. Thanks for the help. One query is there how can I see all the chromosome in a same graph (is it possible) or anything like this in the same graph.
Unfortunately CMB does not allow to combine different datasets. But second output file of aggregate_profile.pl script, is surprisingly an aggregate profile :)
The 2 columns file is specified by the parameter --average_aligned=
. First column is a coordinate, second - average occupancy.
One can just collect this files together and plot profiles in Excel, for example
Sir,
I have one query how much the nucleosome occupancy score varies? I mean what can be the max and min limit of the score? As i have to set a cutoff for that compare condition program which is default .8 and .5.
Hello,
please update the script compare_two_conditions.pl
. I've just corrected there a minor bug...
Occupancy should be on average equal to experimental coverage (Total Nr. of mapped reads divided by the genome length). In the reality, of cause, you have a range from 0 (no reads at all) to many thousands of reads corresponding to low complexity regions.
The value of normalized difference is always spanning from 0 to +2. Therefore, setting the lower filter will define the "similarity" (smaller the number, closer the values). Upper filter helps to identify regions with higher differences.
Thanks, but the code is not showing the updated date till now. Still it is showing updated 20 days ago. Actually the occupancy data of my sample in a chromosome with window size 10 is varying from 11 to up-to thousands. So i am getting confusion that how much they will get normalized in later stages. Am i doing correct or not
no idea why it did not updated.... I did it now.
script compare_two_conditions.pl
does not change occupancy values. What is doing - it's calculating the difference between occupancy values of sample 1 and sample 2, and re scale the absolute value of a difference from 0 to 2.
Sorry, I've just change the code back to previous. Everything was correct. The only problem is the default thresholds values. So, again. Upper threshold is now set to 0.99 by default and lower threshold to -0.99. If the normalized difference is above upper threshold, than the nucleosome occupancy increased in first condition and decrease in second. If the difference is below lower threshold, than the nucleosome is lost in first condition and increased in second condition correspondingly.
Hi, Can you please just tell me what do you mean by -reg genome_annotation.tab option in that program, what format and what file is needed i am not clear, can I use a GFF fiel for my species for that. Please let me know.