reduced_genome.sh error - grep: memory exhausted

ml31k commented 7 years ago

Hi,

I get an error running the reduce_genome.sh script on the full hg19 genome (seems to have issues with the line which selects unique fasta sequences). Increasing the session memory doesn't seem to help (as much as 96g); it might be a built in limitation in grep?

Ended up replacing:

grep -v '^>' ${genome}_${enzyme}_flankingsequences${fl}_unique_2.fa | sort | uniq -i -u | grep -xF -f hg19_dpnii_flanking_sequences_100_unique2.fa.temp -B 1 ${genome}${enzyme}_flankingsequences${fl}_unique2.fa | grep -v '^--' > ${genome}${enzyme}_flankingsequences${fl}_unique.fa

with an awk liner:

awk '$1~/^>/{seq=$1; next};{$1=toupper($1)};a[$1]>0{a[$1]-=1;next};{a[$1]=seq};END{for(x in a){if(a[x]>0){print a[x];print x}}}' hg19_dpnii_flanking_sequences_100_unique_2.fa > hg19_dpnii_flanking_sequences_100_unique.fa

which seems to fix things...

rr1859 commented 7 years ago

Hi thanks a lot for re-writing this to be less memory intensive. I have never had problems running this script but I know other people have. Would you be ok with me giving this line to people if they are having the same problems?

ml31k commented 7 years ago

Certainly! I made a mistake earlier copying the liner. It's now fixed in the above comment

rr1859 commented 7 years ago

Thanks! :)

daler commented 7 years ago

For me, that awk one-liner doesn't give the same results as the original command. Full test with example data below, and an alternative approach that avoids grep and does match results from the original comand.

39MB example data here: https://helix.nih.gov/~dalerr/dm6_DpnII_25mer_flanking.fa

Setup

wget https://helix.nih.gov/~dalerr/dm6_DpnII_25mer_flanking.fa
INPUT=dm6_DpnII_25mer_flanking.fa
OUTPUT=dm6_DpnII_25mer_flanking_unique.fa

Option 1

As used in reduce_genome.sh; ~12 seconds

grep -v '^>' $INPUT \
  | sort \
  | uniq -i -u \
  | grep -xF -f - -B 1 $INPUT \
  | grep -v '^--' > ${OUTPUT}.1

Option 2

From @mimi31k in above post; ~4 seconds

awk \
'$1~/^>/{seq=$1; next};\
{$1=toupper($1)};\
a[$1]>0{a[$1]-=1;next};\
{a[$1]=seq};\
END\
{for(x in a){if(a[x]>0){print a[x];print x}}}' \
$INPUT > ${OUTPUT}.2

Option 3

My alternative attempt. Uses 2 arrays; a to keep track of the sequence count and b to keep track of the header for the last-seen sequence. Since we only care about cases where there's only one sequence, it's OK that the b array entries get overwritten for subsequent sequence occurrences. Plus it also keeps memory usage low. ~3 seconds

paste - - < $INPUT | awk \
  '{a[toupper($2)]++; b[$2]=$1}; END \
  {for(n in a) if (a[n] == 1) print b[n]"\n"n}' \
  > ${OUTPUT}.3

Comparisons

Since the awk methods do not retain the sorting of the original, first I'm sorting the files just for comparison purposes:

sort ${OUTPUT}.1 > out1.sorted
sort ${OUTPUT}.2 > out2.sorted
sort ${OUTPUT}.3 > out3.sorted

Then compare them. Option 2 is reporting about 1% extra sites compared to option 1, but option 1 and option 3 output match.

diff out1.sorted out3.sorted | wc -l  # 0

# Compare option 1 and option 2
# in both:
comm out1.sorted out2.sorted -1 -2 | wc -l  # 1324242

# only in option 2
comm out1.sorted out2.sorted -1 -3 | wc -l  # 15638

# only in option 1
comm out1.sorted out2.sorted -2 -3 | wc -l  # 0

So I think option 3 would be a better approach, as long as it's OK that the de-duped sequences have a different order from the original.

daler commented 7 years ago

Update to get sorted output. This runs slower than the original, at ~20s.

paste - - < $INPUT | awk \
  '{a[toupper($2)]++; b[$2]=$1}; END \
  {for(n in a) if (a[n] == 1) print b[n]"\n"n}' \
  | sort -k1,1 -k2,2n -k3,3n \
  | awk '{print $1":"$2"-"$3"\n"$4}' \
  > ${OUTPUT}.3

rr1859 commented 5 years ago

I had some issues with the script above so I re-wrote this: awk '{if(NR%2){printf("%s\t",$1)}else{print}}' ${genome}_${enzyme}_flankingsequences${fl}_unique2.fa | sort -k2,2| awk -v IGNORECASE=1 '{if($2==SEQ){gsub(">","",$1);ID=ID"|"$1;counter++} else{if(SEQ!="" && counter == 1){printf("%s\n%s\n", ID,SEQ)}SEQ=$2;ID=$1;counter = 1} }END{printf("%s\n%s\n", ID,SEQ)}' > ${genome}${enzyme}_flankingsequences${fl}_unique.fa

rr1859 / R.4Cker