onetrueawk / awk

One true awk
Other
1.98k stars 159 forks source link

for(i in array) not sequential #171

Closed hallenstal closed 1 year ago

hallenstal commented 1 year ago

on MacOS,awk version 20200816: echo "one;three;54;3;86;seven" | awk '{split($0,a,";");for(i in a){print "a[" i "]=" a[i] }}' a[2]=three a[3]=54 a[4]=3 a[5]=86 a[6]=seven a[1]=one

aksr commented 1 year ago

Nice catch, can confirm (5e49ea4).

arnoldrobbins commented 1 year ago

awk purposely does not define the order in which a for (i in array) loop goes through the array. You cannot depend on it to be "sequential", and different implementations will go through the loop in different orders. If you require sequential traversal, do it like so:

n = length(array)
for (i = 1; i <= n; i++)
   do something with array[i]

This should only be used when you know for sure that the indices are sequential (such as with split()) since indices can be strings, or even be missing.

Closing this issue.

hallenstal commented 1 year ago

Well, you could of course have different opinions on this. When an array is indexed by an integer sequence a good design would take them in order.  Of course there is always workarounds. BRMagnusSkickat från min iPhone10 feb. 2023 kl. 15:19 skrev Arnold Robbins @.***>: awk purposely does not define the order in which a for (i in array) loop goes through the array. You cannot depend on it to be "sequential", and different implementations will go through the loop in different orders. If you require sequential traversal, do it like so: n = length(array) for (i = 1; i <= n; i++) do something with array[i] This should only be used when you know for sure that the indices are sequential (such as with split()) since indices can be strings, or even be missing. Closing this issue.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

ryenus commented 7 months ago

Possible to revisit the decision here?

I'd argue for several points:

arnoldrobbins commented 7 months ago

Hello.

Possible to revisit the decision here?

Not really, no. The array management isn't going to change.

I'd argue for several points:

  • Given that awk arrays are actually associative, like maps, the keys could be either numbers or strings, or even a series of numbers with skipped values (holes), therefore it's more preferable to use for(var in array) to loop an array

So this is arguing against ordered traversal of the array.

  • Making things worse, the original awk doesn't even provide a builtin array length function.

If by "original" you mean this version, you are incorrect. It has supported length(array) since January of 2002, over 20 years.

To be able to iterate through a properly indexed array incrementally, one has to first loop through the array using for(var in array) to count the array length, then loop the array again with for(i=0;i<length;i++), to get the order right. This also applies to some other awk distributions.

This isn't necessary. If you know that an array is indexed from 1 to N, you can do this:

for (i = 1; i in array; i++) ...
  • Even worse, if the array contains string keys, then array[pos] would NOT work because the key at position pos could be a string instead of the natural number, causing pos to be an invalid index.

So this also argues against trying to provided ordered traversal of arrays.

  • With for (var in array), the array is iterated almost in sequential order, except the first element is always iterated the last, doesn't it seem like a suspicious off-by-one bug somewhere?
    echo "one;three;54;3;86;seven" | awk '{split($0,a,";");for(i in a){print "a[" i "]=" a[i] }}'
    a[2]=three
    a[3]=54
    a[4]=3
    a[5]=86
    a[6]=seven
    a[1]=one

Arrays are implemented using hash tables. What you're seeing is how things hash. Since the number of items in the array is small, it looks like it's sequential, but if you put in a lot of elements (say 100), you'll see that the order isn't sequential at all. In short, there's no bug here.

As described, ordered traversal isn't so simple. Gawk provides ways to do it. It isn't the default in awk both because it's difficult to define what the ordering should be when numbers and strings are mixed, and also because it adds an extra expensive step to the process: sorting. The cost for setting up an ordered traversal through a hash table, particularly when there are lots of elements, can be measured and it can be expensive. Making ordered traversal the default means that users are paying for a feature they rarely need, and that's not a nice way to write software.

I hope all this helps. Thanks.