ruiguo-bio / replong

source code of the paper "RepLong - de novo repeat discovery from long reads"
16 stars 2 forks source link

awk fails with "illegal reference to array s" #5

Open rsharris opened 6 years ago

rsharris commented 6 years ago

The awk command at line 379 in processRead.sh is failing on my data. The line begins with cat new.fa | awk '{if (ind It reports "illegal reference to array s" and then the script merrily continues and eventually produces two empty files (new.fa and result.fa).

No clue what has gone wrong but my best guess is that the awk command is not getting the type of input it expects. Perhaps new.fa is empty? I've tried to figure out what directory new.fa is in at that point in the script, or all.fa or merged.new.bed, but so far I haven't figured it out.

What I'm actually trying to do is evaluate replong to see if it is comparable to a tool that I am in the process of writing a manuscript for. To that end, I've created a fake data set containing tandem repeats and simulated pacbio reads (in github/makovalab-psu/FakeTRGenome). I had originally been trying your drosophila data but after continually getting empty output files I decided I'd be better off using a smaller dataset in which I knew the ground truth.

Since the awk command has been compacted to fit in a single line, awk is no more specific about where the error occurs than that it occurs in line 1. I've parsed through that awk line and it looks like the first use of s is where it is created by split. I don't know awk well enough to know if there are conditions where split will fail to create an array -- e.g. if the first argument is empty. I'm somewhat confused by the use of an empty string as the separator but I presume the intent is to split $0 into an array of characters. My best guess is the failure is in if ((s[i]=="N") because s is not an array, but I can't see how s can't be an array at that point. I will note that in my experience a string in awk behaves like an array without reporting to splitting it (i.e. length(s) and s[i] ought to work even if you don't do the split).

It looks like the output of that awk command gets piped into a file which is then used to overwrite the awk input file. So after the point of failure the evidence of what caused the failure is neatly destroyed.

At this point my only option seems to be to continue littering your bash scripts with echo statements and things to copy the current state/evidence to some safe place so I can figure out what causes the problem.

rsharris commented 6 years ago

I added a line to processReads to save awk's input (which had 61 sequences and 337Kbp). And then manually ran it through that awk command.

The awk command works on one of my machines, but not on the machine where I'm running replong. It looks like a problem with the awk on that machine, which is actually mawk.

I was wrong about awk treating strings as arrays (I must have been thinking of python), but I did come up with some code that avoids the array and works on both my machines (and gives the same result as yours did when it worked):

    cat new.fa \
      | awk '{
             if (index($0,">")==1)
               print $0;
             else
               {
               if (length($0)==0)
                 { printf("\n");}
               else
                 {
                 notN = 0;
                 for (i=1; i <= length($0); i++)
                   {
                   ch = substr($0,i,1);
                   if ((ch=="N")||(ch=="n"))
                     {continue;}
                   notN = 1;
                   printf("%s", ch)
                   }
                 if (notN==1)
                   { printf("\n");}
                 }
               }
             }' \
      > noN.fa   

It looks like the whole point of that awk command is to copy the file and remove any Ns. There's bound to be an easier way to do that but at the moment I'm too frazzled to figure it out. cat new.fa | tr -d "N" > noN.fa would do it but would also mangle any sequence names that contain an N, and I guess it could leave some blank lines that you may not want.

bzvew commented 6 years ago

Sorry for that misunderstanding. If you have better ideas, I'd like to add that in RepLong! I test RepLong on linux gnu platform, and it's not tested comprehensively. I just make it work, and the code is not elegant. If anyone wants to improve that, I'm very pleased for that.

rsharris commented 6 years ago

Ironically, the machine your awk statement worked on was a Mac, and the machine it failed on was linux (some version of Debian).

For some reason, the linux machine uses mawk for awk. And a very old mawk, at that. With some searching, I found that mawk didn't originally implement length(array); maybe it still doesn't, I don't know. See https://github.com/ThomasDickey/original-mawk/issues/1 According to that thread, support for length(array) in awk-like programs is not standard posix. Which kinda makes sense, because in the awk language arrays are associative, so while length(array) could tell you the number of keys, there's no guarantee that the keys are consecutive from 1 to N.

Knowing that, I was able to get your command to work on both of my machines simply by changing length(s) to length($0).

I just make it work, and the code is not elegant.

Biggest issue for me has been failure handling. I understand it's very difficult to make a bash script that is robust to errors (I certainly don't have that capability myself).