rr1859 / R.4Cker

MIT License
16 stars 15 forks source link

reduced genome script: sed extra character #58

Open fiwong opened 5 years ago

fiwong commented 5 years ago

Hi I am at the last step in generating the reduced genome. I am trying to get bed file from the fa file using

make a BED file of unqiue sequences

grep '^>' ${genome}_${enzyme}_flankingsequences${fl}unique.fa > ${genome}${enzyme}_flankingsites${fl}unique.bed sed -i 's/>//g' ${genome}${enzyme}_flankingsites${fl}unique.bed sed -i 's/:|-/\t/g' ${genome}${enzyme}_flankingsites${fl}_unique.bed

i got error saying sed: extra character. I tried using -i.bak but the format of the bed file is not right

chr10:100002921-100002942:-

I am using Mac, not LINUS so i think the issue is specific to Mac.

Anyone know?

rr1859 commented 5 years ago

Hi I think the -i option may not work on a Mac. Try sed -i -e

fiwong commented 5 years ago

Hi Ramya, Thank you for getting back. It is able to generate a bed file but still with this format

chr10:100002921-100002942:-

…………………..

Elissa

From: Ramya Raviram notifications@github.com Sent: Thursday, August 15, 2019 2:01 PM To: rr1859/R.4Cker R.4Cker@noreply.github.com Cc: Wong, Wai Pung E./Sloan Kettering Institute wongw1@mskcc.org; Author author@noreply.github.com Subject: [EXTERNAL] Re: [rr1859/R.4Cker] reduced genome script: sed extra character (#58)

Hi I think the -i option may not work on a Mac. Try sed -i -e

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/rr1859/R.4Cker/issues/58?email_source=notifications&email_token=AKFFRZX67XI22O6JW6BB7FLQEWKVDA5CNFSM4IL6MAGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4MRHZI#issuecomment-521737189, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKFFRZTQ4HFAYSWX4JB7G4DQEWKVDANCNFSM4IL6MAGA.

=====================================================================

 Please note that this e-mail and any files transmitted from
 Memorial Sloan Kettering Cancer Center may be privileged, confidential,
 and protected from disclosure under applicable law. If the reader of
 this message is not the intended recipient, or an employee or agent
 responsible for delivering this message to the intended recipient,
 you are hereby notified that any reading, dissemination, distribution,
 copying, or other use of this communication or any of its attachments
 is strictly prohibited.  If you have received this communication in
 error, please notify the sender immediately by replying to this message
 and deleting this message, any attachments, and all copies and backups
 from your computer.
rr1859 commented 5 years ago

Can you try this instead? sed -i -e -E "s/:|-/$(printf '\t')/g" ${genome}${enzyme}flanking_sites${fl}_unique.bed

fiwong commented 5 years ago

So I did (to get the fa file, I use one of the solution in the thread because the original did not give me a file)

grep -v '^>' ${genome}_${enzyme}_flankingsequences${fl}_unique2.fa | sort | uniq -i -u | grep -xF -f - -B 1 ${genome}${enzyme}_flankingsequences${fl}_unique2.fa | grep -v '^--' > ${genome}${enzyme}_flankingsequences${fl}_unique.fa

I use

paste - - < hg19_all_ecori_flanking_sites_unique.fa | awk \ '{a[toupper($2)]++; b[$2]=$1}; END \ {for(n in a) if (a[n] == 1) print b[n]"\n"n}' \ | sort -k1,1 -k2,2n -k3,3n \ | awk '{print $1":"$2"-"$3"\n"$4}' \

hg19_all_ecori_flanking_sites_unique_final.fa

make a BED file of unqiue sequences

grep '^>' hg19_all_ecori_flanking_sites_unique_final.fa > hg19_all_ecori_flanking_sites_unique_final.bed sed -i -e -E "s/:|-/$(printf '\t')/g" hg19_all_ecori_flanking_sites_unique_final.bed

the bed now is

chr10 100002921 100002942

it looks better now, but is there a way to remove > in front of chr?

From: Ramya Raviram notifications@github.com Sent: Thursday, August 15, 2019 2:54 PM To: rr1859/R.4Cker R.4Cker@noreply.github.com Cc: Wong, Wai Pung E./Sloan Kettering Institute wongw1@mskcc.org; Author author@noreply.github.com Subject: [EXTERNAL] Re: [rr1859/R.4Cker] reduced genome script: sed extra character (#58)

Can you try this instead? sed -i -e -E "s/:|-/$(printf '\t')/g" ${genome}${enzyme}flanking_sites${fl}_unique.bed

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/rr1859/R.4Cker/issues/58?email_source=notifications&email_token=AKFFRZVO2XCNGYNWFPOFCKTQEWQ4DA5CNFSM4IL6MAGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4MVSZY#issuecomment-521754983, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKFFRZUH75OMLKJ4SB23W3LQEWQ4DANCNFSM4IL6MAGA.

=====================================================================

 Please note that this e-mail and any files transmitted from
 Memorial Sloan Kettering Cancer Center may be privileged, confidential,
 and protected from disclosure under applicable law. If the reader of
 this message is not the intended recipient, or an employee or agent
 responsible for delivering this message to the intended recipient,
 you are hereby notified that any reading, dissemination, distribution,
 copying, or other use of this communication or any of its attachments
 is strictly prohibited.  If you have received this communication in
 error, please notify the sender immediately by replying to this message
 and deleting this message, any attachments, and all copies and backups
 from your computer.
rr1859 commented 5 years ago

Can you add on a sed command to remove the '>' ?

grep '^>' hg19_all_ecori_flanking_sites_unique_final.fa | sed 's/>//g' - > hg19_all_ecori_flanking_sites_unique_final.bed sed -i -e -E "s/:|-/$(printf '\t')/g" hg19_all_ecori_flanking_sites_unique_final.bed

fiwong commented 5 years ago

I think sed 's/>//g' does not work for mac?

From: Ramya Raviram notifications@github.com Sent: Thursday, August 15, 2019 3:49 PM To: rr1859/R.4Cker R.4Cker@noreply.github.com Cc: Wong, Wai Pung E./Sloan Kettering Institute wongw1@mskcc.org; Author author@noreply.github.com Subject: [EXTERNAL] Re: [rr1859/R.4Cker] reduced genome script: sed extra character (#58)

Can you add on a sed command to remove the '>' ?

grep '^>' hg19_all_ecori_flanking_sites_unique_final.fa | sed 's/>//g' - > hg19_all_ecori_flanking_sites_unique_final.bed sed -i -e -E "s/:|-/$(printf '\t')/g" hg19_all_ecori_flanking_sites_unique_final.bed

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/rr1859/R.4Cker/issues/58?email_source=notifications&email_token=AKFFRZQX4DOYEMD4HC64YBTQEWXLHA5CNFSM4IL6MAGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4MZ7FA#issuecomment-521772948, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKFFRZQ4XPAINWRAZ53JWATQEWXLHANCNFSM4IL6MAGA.

=====================================================================

 Please note that this e-mail and any files transmitted from
 Memorial Sloan Kettering Cancer Center may be privileged, confidential,
 and protected from disclosure under applicable law. If the reader of
 this message is not the intended recipient, or an employee or agent
 responsible for delivering this message to the intended recipient,
 you are hereby notified that any reading, dissemination, distribution,
 copying, or other use of this communication or any of its attachments
 is strictly prohibited.  If you have received this communication in
 error, please notify the sender immediately by replying to this message
 and deleting this message, any attachments, and all copies and backups
 from your computer.
rr1859 commented 5 years ago

Hmm ok then maybe just add another sed -i command

Ramyas-iMac:~ ramyaraviram$ cat test.txt

t:t-b Ramyas-iMac:~ ramyaraviram$ sed -i -e -E "s/:|-/$(printf '\t')/g" test.txt Ramyas-iMac:~ ramyaraviram$ cat test.txt t t b Ramyas-iMac:~ ramyaraviram$ sed -i -e 's/>//g' test.txt Ramyas-iMac:~ ramyaraviram$ cat test.txt t t b

fiwong commented 5 years ago

Thx it finally works by using this

grep '^>' hg19_all_ecori_flanking_sites_unique_final.fa > hg19_all_ecori_flanking_sites_unique_final.bed sed -i -e "s/>//g" hg19_all_ecori_flanking_sites_unique_final.bed sed -i -e -E "s/:|-/$(printf '\t')/g" hg19_all_ecori_flanking_sites_unique_final.bed

From: Ramya Raviram notifications@github.com Sent: Thursday, August 15, 2019 4:14 PM To: rr1859/R.4Cker R.4Cker@noreply.github.com Cc: Wong, Wai Pung E./Sloan Kettering Institute wongw1@mskcc.org; Author author@noreply.github.com Subject: [EXTERNAL] Re: [rr1859/R.4Cker] reduced genome script: sed extra character (#58)

Hmm ok then maybe just add another sed -i command

Ramyas-iMac:~ ramyaraviram$ cat test.txt

t:t-b Ramyas-iMac:~ ramyaraviram$ sed -i -e -E "s/:|-/$(printf '\t')/g" test.txt Ramyas-iMac:~ ramyaraviram$ cat test.txt t t b Ramyas-iMac:~ ramyaraviram$ sed -i -e 's/>//g' test.txt Ramyas-iMac:~ ramyaraviram$ cat test.txt t t b

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/rr1859/R.4Cker/issues/58?email_source=notifications&email_token=AKFFRZWO2I2RAAD4WRVFEPTQEW2JDA5CNFSM4IL6MAGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4M35QI#issuecomment-521780929, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKFFRZUBHEDPV6KQ2A5INO3QEW2JDANCNFSM4IL6MAGA.

=====================================================================

 Please note that this e-mail and any files transmitted from
 Memorial Sloan Kettering Cancer Center may be privileged, confidential,
 and protected from disclosure under applicable law. If the reader of
 this message is not the intended recipient, or an employee or agent
 responsible for delivering this message to the intended recipient,
 you are hereby notified that any reading, dissemination, distribution,
 copying, or other use of this communication or any of its attachments
 is strictly prohibited.  If you have received this communication in
 error, please notify the sender immediately by replying to this message
 and deleting this message, any attachments, and all copies and backups
 from your computer.
fiwong commented 5 years ago

Hi Ramya, I thought I got the fa and bed for bowtie-build but I got error about the fa file. Then I open the file and realize the fa file generate using

paste - - < hg19_all_ecori_flanking_sites_unique.fa | awk \ '{a[toupper($2)]++; b[$2]=$1}; END \ {for(n in a) if (a[n] == 1) print b[n]"\n"n}' \ | sort -k1,1 -k2,2n -k3,3n \ | awk '{print $1":"$2"-"$3"\n"$4}' \

hg19_all_ecori_flanking_sites_unique_final.fa

Is not the normal fa format

chr14:85801193-85801214:-

With all the sequence at the bottom

I tried your original script many times but it didn’t give me any file

grep -v '^>' ${genome}_${enzyme}_flankingsequences${fl}_unique2.fa | sort | uniq -i -u | grep -xF -f - -B 1 ${genome}${enzyme}_flankingsequences${fl}_unique2.fa | grep -v '^--' > ${genome}${enzyme}_flankingsequences${fl}_unique.fa

Is there any way to get around?

From: Ramya Raviram notifications@github.com Sent: Thursday, August 15, 2019 4:14 PM To: rr1859/R.4Cker R.4Cker@noreply.github.com Cc: Wong, Wai Pung E./Sloan Kettering Institute wongw1@mskcc.org; Author author@noreply.github.com Subject: [EXTERNAL] Re: [rr1859/R.4Cker] reduced genome script: sed extra character (#58)

Hmm ok then maybe just add another sed -i command

Ramyas-iMac:~ ramyaraviram$ cat test.txt

t:t-b Ramyas-iMac:~ ramyaraviram$ sed -i -e -E "s/:|-/$(printf '\t')/g" test.txt Ramyas-iMac:~ ramyaraviram$ cat test.txt t t b Ramyas-iMac:~ ramyaraviram$ sed -i -e 's/>//g' test.txt Ramyas-iMac:~ ramyaraviram$ cat test.txt t t b

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/rr1859/R.4Cker/issues/58?email_source=notifications&email_token=AKFFRZWO2I2RAAD4WRVFEPTQEW2JDA5CNFSM4IL6MAGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4M35QI#issuecomment-521780929, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKFFRZUBHEDPV6KQ2A5INO3QEW2JDANCNFSM4IL6MAGA.

=====================================================================

 Please note that this e-mail and any files transmitted from
 Memorial Sloan Kettering Cancer Center may be privileged, confidential,
 and protected from disclosure under applicable law. If the reader of
 this message is not the intended recipient, or an employee or agent
 responsible for delivering this message to the intended recipient,
 you are hereby notified that any reading, dissemination, distribution,
 copying, or other use of this communication or any of its attachments
 is strictly prohibited.  If you have received this communication in
 error, please notify the sender immediately by replying to this message
 and deleting this message, any attachments, and all copies and backups
 from your computer.
rr1859 commented 5 years ago

Please give this a try awk '{if(NR%2){printf("%s\t",$1)}else{print}}' ${genome}_${enzyme}_flankingsequences${fl}_unique2.fa | sort -k2,2| awk -v IGNORECASE=1 '{if($2==SEQ){gsub(">","",$1);ID=ID"|"$1;counter++} else{if(SEQ!="" && counter == 1){printf("%s\n%s\n", ID,SEQ)}SEQ=$2;ID=$1;counter = 1} }END{printf("%s\n%s\n", ID,SEQ)}' > ${genome}${enzyme}_flankingsequences${fl}_unique.fa

fiwong commented 5 years ago

Hi Ramya,

I tried awk '{if(NR%2){printf("%s\t",$1)}else{print}}' ${genome}_${enzyme}flanking_sequences${fl}unique_2.fa | sort -k2,2| awk -v IGNORECASE=1 '{if($2==SEQ){gsub(">","",$1);ID=ID"|"$1;counter++} else{if(SEQ!="" && counter == 1){printf("%s\n%s\n", ID,SEQ)}SEQ=$2;ID=$1;counter = 1} }END{printf("%s\n%s\n", ID,SEQ)}' > ${genome}${enzyme}flanking_sequences${fl}_unique.fa

but it gives me the same file size (and row number) as unique_2.fa, which i think did not remove duplicate.

I also tried awk '$1~/^>/{seq=$1; next};{$1=toupper($1)};a[$1]>0{a[$1]-=1;next};{a[$1]=seq};END{for(x in a){if(a[x]>0){print a[x];print x}}}' hg19_all_ecori_flanking_sites_unique.fa > hg19_all_ecori_flanking_sites_unique_final.fa

same thing. did not remove duplicate.

The error for the original code is grep -v '^>' hg19_all_ecori_flanking_sites_unique.fa | sort | uniq -i -u | grep -xF -f hg19_all_ecori_flanking_sites_unique.fa | grep -v '^--' > hg19_all_ecori_flanking_sites_unique_final.fa

no -B 1 directory so i tried removing grep -B 1 flag

but the program just go on without generating a file.

rr1859 commented 5 years ago

Is it possible that you do not have duplicate sequences? What is the fragment length you are using?Can you try the code with a small test FASTA file that does have duplicates? I tested it on a Mac and the IGNORECASE option does not seem to work but it still removes exact duplicates.