Closed MManee closed 6 years ago
for newbies in GitHub, this thread #1 , should be useful!
For coding steps :
bedtools intersect -a reads.bed -b genes.bed > coding.bed
For non coding steps:
bedtools complement -i coding.bed -g genes_non.bed > non-coding.bed
after running the complement command appear the following error:
Error: Sorted input specified, but the file coding.bed has the following out of order record
chr4 252055 252253 CG1674-RC_exon_0_0_chr4_252056_f 0 +
to solve this problem, I've sorted the "coding.bed" records using bedtools command:
sortBed -i coding.bed > sorted_coding.bed
as a result of sorting command we got new sorted file which is "sorted_coding.bed"
then, I've running the complement command with new sorted file , but I got new error which is:
chrU 18133 18346 CG40189-RA_exon_0_0_chrU_18134_f 0 +
to solve new error, I've merged the "sorted_coding.bed" records using bedtools comand bedtools merge sorted_coding.bed -i > merged_sorted_coding.bed
then, I've running the complement command with new merge sorted file "merged_sorted_coding.bed" , but I got same previous erroe.
bedtools complement -i merged_sorted_coding.bed -g genes_non.bed > non-coding.bed
to solve this error, I tracking the error by return to "genes_non.bed" file. then, I move the "chrU" recored to end of file then save file and try to rerun the complement command again.
after running the complement command, I got same error with new record which is "chrUextra" .
I repeated the previous step with this erroe by moving "chrUextra" at the end of "genes_non.bed" file.
also I got same error with next record , so I repeated the previous step with remaining records. after this step the error was going
please find these files:
To make sure the results of previous files, I've calculate of coding lengths and noncoding lengths and compared with chromosome lengths. Results of calculate were not equal.
to solve this problem, We have to try the following steps:
1- find coding and non coding data from orginal files using bedtools.
coding_subtract.txt
non_coding_subtract.txt
bedtools intersect -a reads.bed -b genes.bed > coding.bed
2- find non coding data using bedtools.
bedtools complement -i coding.bed -g genes_non.bed > non-coding.bed
3- we consider solution in previous issues that solve complement error.
4- find the coding data agin form noncoding data by run complement command 5- run subtract command on both files Rresult of calculation :
total non coding length :139034833 total coding length :29701678 total coding +non coding :168736511
gene length :168736537
genomeCoverageBed -i coding.bed -g allgenome
steps to generate coding regions an noncoding regions of Drosophila melanogaster:
sort -k 1,1 -k2,2n allgenomefile > sorted_genome
sort -k 1,1 -k2,2n dm3_exons.txt > sorted_dm3_exons.bed
bedtools complement -i sorted_dm3_exons.bed -g sorted_genome > noncoding.bed
category | length (bp) |
---|---|
coding regions | 29701704 |
noncoding regions | 139034833 |
total | 168736537 |
please find the intronic data and perl code
last update data.tar.gz
this is final steps and result of Perl code : 1- downloded chromosom ( genom) ==> sorted_genome.txt 2- sort the allgenomefile.txt into sorted_genome.txt sort -k 1,1 -k2,2n allgenomefile.txt > sorted_genome.txt 3- sort the dm3_exons.txt to sorted_dm3_exons.bed sort -k 1,1 -k2,2n dm3_exons.txt > sorted_dm3_exons.bed 4- creat new genome file with valid entries by perl 5- creat the noncoding file bedtools complement -i sorted_dm3_exons.bed -g sorted_genome > noncoding.bed 6- creat the coding file by copying sorted_dm3_exons.bed to coding.bed 7- added 0 as start last genom file by perl code 8- bedtools subtract -a sorted_genome.bed -b noncoding.bed > coding.bed 9- create intronic file using cooding file without +1 -1 10- system("bedtools subtract -a ../data/noncoding.bed -b ../data/intronic.bed > ../data/intergenic.bed"); 11- calculate lengh of intronic + intergenic = noncoding lenght 12- calculate lengh of noncoding + codind = total lenght of genome
Results :
total lenght of coding : 29701704
total lenght of noncoding : 139034833
total of lenght of noncoding and coding : 168736537
total lenght of genome : 168736537
total lenght of intronic : 109319658
total lenght of intergenic : 29715175
total lenght of intergenic and intronic: 139034833
total lenght of noncoding : 139034833
This is so awesome! Well done, Amer. The next step is to generate conserved noncoding elements (CNEs) and noncoding spacers of intronic and intergenic regions.
Alrighty, I just started ... so far I have just edited the chromInfo file. The Code can be found at https://github.com/kacst-bioinfo-lab/labwork/tree/master/code/abdulmalek/drosphila.py To make things easier and reproducible I left the input to the user as seen in lines 7-8,,however I am still stuck at how to automate this with os.system at lines 30-31 For the algorithm: 1- Add new column with 0 input for all rows inside chromInfo.txt, and convert the file into bed format. 2- Sort the chromInfo.bed using system sort | uniq functions. 3- Sort the dm3_exons.txt using sort | uniq functions.
For the output file please check https://github.com/kacst-bioinfo-lab/labwork/tree/master/doc/abdulmalek/chromInfo.bed https://github.com/kacst-bioinfo-lab/labwork/tree/master/doc/abdulmalek/chromSort.bed https://github.com/kacst-bioinfo-lab/labwork/tree/master/doc/abdulmalek/dm3_exonsSort.bed
@MManee @amer1404 @SULTAN-ALHARBI
Cheers,
Print "Hello Group! \n";
I would like to thank Dr. Manee for involving me in the group. I Hope to be a positive addition to the group and absolutely I will learn alot from you.
Hello Everyone
please find below the final result, Tomorrow I will upload all related files.
Total lenght of coding : 29701704
Total lenght of noncoding : 139034833
Total of lenght of noncoding and coding : 168736537
Total lenght of genome : 168736537
Total lenght of intronic : 109319658
Total lenght of intergenic : 29715175
Total lenght of intergenic and intronic : 139034833
Total lenght of noncoding : 139034833
*************************************************************************************
Total lenght of coding CEs : 18219044
Total lenght of coding spacer : 11482662
Total lenght of CEs and spacer coding : 29701706
Total lenght of coding : 29701704
*************************************************************************************
Total lenght of intergenic CNEs : 2744358
Total lenght of intergenic spacer : 26970817
Total lenght of CNEs and spacer coding : 29715175
Total lenght of intergenic : 29715175
*************************************************************************************
Total lenght of intronic CNEs : 31527418
Total lenght of intronic spacer : 77792240
Total lenght of CNEs and spacer coding : 109319658
Total lenght of intronic : 109319658
*************************************************************************************`
It looks perfect. We will need to check the datasets using genome browser. Well done Amer.
Hello dear All, :)
Kindly please find the updated files in my folders( data and code).