rr1859 / R.4Cker

MIT License
16 stars 15 forks source link

reduced_genome.sh error on #get only sequences from FASTA file #31

Open trotos opened 7 years ago

trotos commented 7 years ago

Hi, there seem to be 2 issues regarding the following step:

#get only sequences from FASTA file
grep -v '^>' ${genome}_${enzyme}_flanking_sequences_${fl}_unique_2.fa | sort | uniq -i -u | grep -xF -f - -B 1 ${genome}_${enzyme}_flanking_sequences_${fl}_unique_2.fa | grep -v '^--' > ${genome}_${enzyme}_flanking_sequences_${fl}_unique.fa

1st: while the removing of the ">" lines, sorting and getting the unique files works and produces something like the following: (grep -v '^>' ${genome}_${enzyme}_flankingsequences${fl}_unique_2.fa | sort | uniq -i -u)

ggccaaaacaaaggggttacagggcacgtgccctgtaacagaaagcagtc
ggccaaaacaaaggggttataggacccctgcaagtctgaaatccagaagg
ggccaaaacaaagtggctacaggccccatgtgagtccaaaatccagtagg
ggccaaaacaaatgggctattggccccatgtaagtctgaaacccaacagg
ggccaaaacaaattctttacttatgctaattttgccttagacttttattt
GGCCAAAACAAGGAAACACTGAGAAGAGACGCCTGTTTGCCAGGTGCTTG
ggccaaaacaagggggctacaggcccatataagtcagcaatccagcaggg
GGCCAAAACAAGTTCTTTTTCCATTATTGAATACTTTTTGAGCTATAAGT
GGCCAAAACAATTCTTAAATAATTACTGCATTTTTTTTCAAAATGTAAGA
ggccaaaacacctaacccctctgttctctctgcctcagttgactcatgaa
GGCCAAAACAGAAGAGTCTGGCTGTCCCATCTCTGTATCTTTCAGGATCC

the following command ( grep -xF -f - -B 1 ${genome}_${enzyme}_flankingsequences${fl}_unique_2.fa ) produces a zero bytes file. Could you please see if there is an error?

2nd: why you remove the duplicates? this is a PCR based technique and it is expected for the file to be full of PCR duplicates.

Thank you for your answer in advance

rr1859 commented 7 years ago

Hi,

1)You are missing an underscore here - grep -xF -f - -B 1 ${genome}_${enzyme}flanking_sequences${fl}_unique2.fa. It should be : grep -xF -f - -B 1 ${genome}${enzyme}flankingsequences${fl}_unique_2.fa. Please try that and let me now.

2) the duplicates that are being removed here are the fragments that we are mapping to (short fragments adjacent to the RE sites. If there are duplicate sequences in the genome we cannot properly assign our interactions to these regions.

mindfaraway commented 7 years ago

I got the same problem when running reduced_genome.sh. I only changed line 13 to fl=100, 15 to enzyme=dpnii. After the program #get only sequences from FASTA file, it created a 0 byte file mm10_dpnii_flanking_sequences_100_unique.fa. Then, creating a 0 byte file mm10_dpnii_flanking_sites_100_unique.bed. I ran the script with MacOS Sierra 10.12.3 by Terminal command-line and I have no idea why the error occurs.

Any suggestion?

rr1859 commented 7 years ago

Do you have oligoMatch and bedtools installed?

mindfaraway commented 7 years ago

Hi, I am now using another MAC machine to run this program. It return an Error message: "grep: -: No such file or directory" and create a 0 byte file "mm10_dpnii_flanking_sequences_50unique.fa" The error might be caused by the dash symbol in the command "grep -xF -f - -B 1 ${genome}${enzyme}_flankingsequences${fl}_unique_2.fa" in line 32. I guess there may have some incompatible command format between linux and MAC terminal when processing the grep command. How can I solve this problem?

mindfaraway commented 7 years ago

Okey... I replaced line 32 with the code from the issues "reducedgenome.sh error - grep: memory exhausted" Here's the code I use: paste - - < ${genome}${enzyme}_flankingsequences${fl}_unique2.fa | awk '{a[toupper($2)]++; b[$2]=$1}; END {for(n in a) if (a[n] == 1) print b[n]"\n"n}' | sort -k1,1 -k2,2n -k3,3n | awk '{print $1":"$2"-"$3"\n"$4}' > ${genome}${enzyme}_flankingsequences${fl}_unique.fa

Now it generated a final file which is about 143 MB. Seems to be the right way.....

By the way, I also edit the code at line 40 and 41 due to that the "sed" code in my MAC machine requires a backup file name for creating backup file. I changed it into:

sed -i .bak 's/>//g' ${genome}_${enzyme}_flankingsites${fl}unique.bed rm ${genome}${enzyme}_flankingsites${fl}unique.bed.bak sed -i .bak 's/:|-/\t/g' ${genome}${enzyme}_flankingsites${fl}unique.bed rm ${genome}${enzyme}_flankingsites${fl}_unique.bed.bak