rr1859 / R.4Cker

MIT License
16 stars 15 forks source link

reduced_genome.sh weird extension addition #56

Closed jvandinter closed 5 years ago

jvandinter commented 5 years ago

Hi, I am trying to follow the 4C pipeline with my own data. When trying to run the bash script to create a reduced genome, all output files gain a very weird character in the file extension, which disrupts the script. You can see an example below.

I downloaded the file and edited it in Notepad++, which should not give any problems.

Here is how it looks: image

Here is the edited script that I used (I replaced ${genome}, ${enzyme}, and ${fl} with the values used as shown. This way, it created the big bed graph.)

!/bin/bash

PBS -l nodes=1:ppn=1,walltime=10:00:00

PBS -N reduced_genome

PBS -l mem=20GB

module load bedtools

module load kent

cd /Reduced/

fragment length

fl = 54

primary enzyme for file name

enzyme = dpnii

genome for file name

genome = hg19

get coordinates for all RE sites in the genome

oligoMatch dpnii.fa hg19.fa hg19_dpnii_restriction_sites_oligomatch.bed

get coordinates of upstream fragments

awk -v fl=$fl '{print $1"\t"$2-fl"\t"$2}' hg19_dpnii_restriction_sites_oligomatch.bed >up.txt

get coordinates of downstream fragments

awk -v fl=$fl '{print $1"\t"$3"\t"$3+fl}' hg19_dpnii_restriction_sites_oligomatch.bed > down.txt

combine up and downstream fragments

cat up.txt down.txt > hg19_dpnii_flanking_sites_54_2.bed

remove any fragments with negative coordinates

awk '{if($2 >= 0 && $3 >=0) print $0}' hg19_dpnii_flanking_sites_54_2.bed | grep -v -E 'random|JH|GL' - | sort -k1,1 -k2,2n | uniq > hg19_dpnii_flanking_sites_54_unique_2.bed

get the sequence of unique flanking coordinates

fastaFromBed -fi hg19.fa -bed hg19_dpnii_flanking_sites_54_unique_2.bed -fo hg19_dpnii_flanking_sequences_54_unique_2.fa

get only sequences from FASTA file

grep -v '^>' hg19_dpnii_flanking_sequences_54_unique_2.fa | sort | uniq -i -u | grep -xF -f - -B 1 hg19_dpnii_flanking_sequences_54_unique_2.fa | grep -v '^--' > hg19_dpnii_flanking_sequences_54_unique.fa

remove unwanted intermediate files

rm up.txt

rm down.txt

make a BED file of unqiue sequences

grep '^>' hg19_dpnii_flanking_sequences_54_unique.fa > hg19_dpnii_flanking_sites_54_unique.bed sed -i 's/>//g' hg19_dpnii_flanking_sites_54_unique.bed sed -i 's/:|-/\t/g' hg19_dpnii_flanking_sites_54_unique.bed

exit 0;

Do you know how this is caused or how I can fix this?

jvandinter commented 5 years ago

I managed to get the script working by just manually inputting it line by line, but the second to last step gives me issues:

get only sequences from FASTA file

grep -v '^>' hg19_dpnii_flanking_sequences_54_unique_2.fa | sort | uniq -i -u | grep -xF -f hg19_dpnii_flanking_sequences_54_unique_2.fa -B 1 hg19_dpnii_flanking_sequences_54_unique_2.fa | grep -v '^--' > hg19_dpnii_flanking_sequences_54_unique.fa

When I run this, it will give a pipe error while running the uniq command.

rr1859 commented 5 years ago

Hi, I see that you have commented out the fl variable but have not changed it in the awk commands

jvandinter commented 5 years ago

Indeed, I caught it as well. Still, I get a uniq: broken pipe error at the step in my second comment. I am very much a beginner with bash, and googling around did not help either.

jvandinter commented 5 years ago

Im still stuck on the command I mentioned in a previous comment, but for a different reason:

get only sequences from FASTA file

grep -v '^>' hg19_dpnii_flanking_sequences_54_unique_2.fa | sort | uniq -i -u | grep -xF -f - -B 1 hg19_dpnii_flanking_sequences_54_unique_2.fa | grep -v '^--' > hg19_dpnii_flanking_sequences_54_unique.fa

I tried to split the pipe like this: grep -v '^>' hg19_dpnii_flanking_sequences_54_unique_2.fa > hg19_dpnii_flanking_sequences_54_unique_3.fa
sort hg19_dpnii_flanking_sequences_54_unique_3.fa > hg19_dpnii_flanking_sequences_54_unique_4.fa uniq -i -u hg19_dpnii_flanking_sequences_54_unique_4.fa > hg19_dpnii_flanking_sequences_54_unique_5.fa grep -xF -f - -B 1 hg19_dpnii_flanking_sequences_54_unique_5.fa | grep -v '^--' > hg19_dpnii_flanking_sequences_54_unique_6.fa grep -v '^--' hg19_dpnii_flanking_sequences_54_unique_6.fa > hg19_dpnii_flanking_sequences_54_unique.fa

When I run the bold command, it will run indefinetely without any progress. I think it has to do with the grep -xF -f - command, as it does not specify which file to take the line from. What file do I need to cross-reference with the fasta file?

Thanks!