jvandinter commented 5 years ago

Hi, I am trying to follow the 4C pipeline with my own data. When trying to run the bash script to create a reduced genome, all output files gain a very weird character in the file extension, which disrupts the script. You can see an example below.

I downloaded the file and edited it in Notepad++, which should not give any problems.

Here is how it looks:

Here is the edited script that I used (I replaced ${genome}, ${enzyme}, and ${fl} with the values used as shown. This way, it created the big bed graph.)

!/bin/bash

PBS -l nodes=1:ppn=1,walltime=10:00:00

PBS -N reduced_genome

PBS -l mem=20GB

module load bedtools

module load kent

cd /Reduced/

fragment length

fl = 54

primary enzyme for file name

enzyme = dpnii

genome for file name

genome = hg19

get coordinates for all RE sites in the genome

oligoMatch dpnii.fa hg19.fa hg19_dpnii_restriction_sites_oligomatch.bed

get coordinates of upstream fragments

awk -v fl=$fl '{print $1"\t"$2-fl"\t"$2}' hg19_dpnii_restriction_sites_oligomatch.bed >up.txt

get coordinates of downstream fragments

awk -v fl=$fl '{print $1"\t"$3"\t"$3+fl}' hg19_dpnii_restriction_sites_oligomatch.bed > down.txt

combine up and downstream fragments

cat up.txt down.txt > hg19_dpnii_flanking_sites_54_2.bed

remove any fragments with negative coordinates

awk '{if($2 >= 0 && $3 >=0) print $0}' hg19_dpnii_flanking_sites_54_2.bed | grep -v -E 'random|JH|GL' - | sort -k1,1 -k2,2n | uniq > hg19_dpnii_flanking_sites_54_unique_2.bed

get the sequence of unique flanking coordinates

fastaFromBed -fi hg19.fa -bed hg19_dpnii_flanking_sites_54_unique_2.bed -fo hg19_dpnii_flanking_sequences_54_unique_2.fa

get only sequences from FASTA file

grep -v '^>' hg19_dpnii_flanking_sequences_54_unique_2.fa | sort | uniq -i -u | grep -xF -f - -B 1 hg19_dpnii_flanking_sequences_54_unique_2.fa | grep -v '^--' > hg19_dpnii_flanking_sequences_54_unique.fa

remove unwanted intermediate files

rm up.txt

rm down.txt

make a BED file of unqiue sequences

grep '^>' hg19_dpnii_flanking_sequences_54_unique.fa > hg19_dpnii_flanking_sites_54_unique.bed sed -i 's/>//g' hg19_dpnii_flanking_sites_54_unique.bed sed -i 's/:|-/\t/g' hg19_dpnii_flanking_sites_54_unique.bed

exit 0;

Do you know how this is caused or how I can fix this?

jvandinter commented 5 years ago

I managed to get the script working by just manually inputting it line by line, but the second to last step gives me issues:

get only sequences from FASTA file

grep -v '^>' hg19_dpnii_flanking_sequences_54_unique_2.fa | sort | uniq -i -u | grep -xF -f hg19_dpnii_flanking_sequences_54_unique_2.fa -B 1 hg19_dpnii_flanking_sequences_54_unique_2.fa | grep -v '^--' > hg19_dpnii_flanking_sequences_54_unique.fa

When I run this, it will give a pipe error while running the uniq command.

rr1859 commented 5 years ago

Hi, I see that you have commented out the fl variable but have not changed it in the awk commands

jvandinter commented 5 years ago

Indeed, I caught it as well. Still, I get a uniq: broken pipe error at the step in my second comment. I am very much a beginner with bash, and googling around did not help either.

jvandinter commented 5 years ago

Im still stuck on the command I mentioned in a previous comment, but for a different reason:

get only sequences from FASTA file

grep -v '^>' hg19_dpnii_flanking_sequences_54_unique_2.fa | sort | uniq -i -u | grep -xF -f - -B 1 hg19_dpnii_flanking_sequences_54_unique_2.fa | grep -v '^--' > hg19_dpnii_flanking_sequences_54_unique.fa

I tried to split the pipe like this: grep -v '^>' hg19_dpnii_flanking_sequences_54_unique_2.fa > hg19_dpnii_flanking_sequences_54_unique_3.fa
sort hg19_dpnii_flanking_sequences_54_unique_3.fa > hg19_dpnii_flanking_sequences_54_unique_4.fa uniq -i -u hg19_dpnii_flanking_sequences_54_unique_4.fa > hg19_dpnii_flanking_sequences_54_unique_5.fa grep -xF -f - -B 1 hg19_dpnii_flanking_sequences_54_unique_5.fa | grep -v '^--' > hg19_dpnii_flanking_sequences_54_unique_6.fa grep -v '^--' hg19_dpnii_flanking_sequences_54_unique_6.fa > hg19_dpnii_flanking_sequences_54_unique.fa

When I run the bold command, it will run indefinetely without any progress. I think it has to do with the grep -xF -f - command, as it does not specify which file to take the line from. What file do I need to cross-reference with the fasta file?

Thanks!

rr1859 / R.4Cker

reduced_genome.sh weird extension addition #56

!/bin/bash

PBS -l nodes=1:ppn=1,walltime=10:00:00

PBS -N reduced_genome

PBS -l mem=20GB

module load bedtools

module load kent

cd /Reduced/

fragment length

fl = 54

primary enzyme for file name

enzyme = dpnii

genome for file name

genome = hg19

get coordinates for all RE sites in the genome

get coordinates of upstream fragments

get coordinates of downstream fragments

combine up and downstream fragments

remove any fragments with negative coordinates

get the sequence of unique flanking coordinates

get only sequences from FASTA file

remove unwanted intermediate files

rm up.txt

rm down.txt

make a BED file of unqiue sequences

get only sequences from FASTA file

get only sequences from FASTA file