nationalarchives / hms-nhs-scripts

MIT License
0 stars 0 forks source link

`panoptes_aggregation reduce` tries to `eval` text fields #34

Open bogden1 opened 2 years ago

bogden1 commented 2 years ago

The reduction step produces warning of these forms:

<string>:1: SyntaxWarning: 'int' object is not subscriptable; perhaps you missed a comma?
<string>:1: SyntaxWarning: list indices must be integers or slices, not ellipsis; perhaps you missed a comma?
<string>:1: SyntaxWarning: 'list' object is not callable; perhaps you missed a comma?

This appears to be due to text that contains [ or { getting eval'd during the reduction process, through this call chain:

reduce_panoptes_csv.py:reduce_subject -> csv_utils.py:unflatten_data

I guess the intent is to correctly translate text-ified Python data back into data, but this seems not right for pure text fields.

bogden1 commented 2 years ago

Running an instrumented version of panoptes_aggregation, with tag v4.0.0 (matching our requirements.txt) I have learned which strings are successfully eval'd.

I'm not sure what the consequences of this are. It looks like it could be worse if someone happens to write valid Python dictionary syntax in a text field. Some things get translated into strings, which is what they were in the first place. Sometimes a string gets converted into a list, e.g. [ ] becomes an empty list, [00] becomes a list with the single element 0.

The most common case seems to be '[...]', which gets converted into a list containing an instance of class Ellipsis, which is not ideal. The same goes for curlies ({...} gives a set containing an Ellipsis instance). The Ellipsis class does at least print as the word "Ellipsis", but I have no idea what the consequences may be of having unexpected Ellipsis objects sloshing around.

On the whole, it looks like the impact is not too bad. In phase1 I have (with counts):

  1 [  ]
112 [...]
  7 [[...]]
  1 []
  3 {...}
  7 [00]

This is for the text workflows only -- in phase 1 we also have a couple of dropdowns, where I think that this eval behaviour is actually what we want.

and in phase2 (which is incomplete) I have the following strings that are successfully eval'd:

  1   [    ]
  1   [ ]
395   [...]
  1   {...}
 52   [00]
  1   "For [...] to do as the is ordered by the Doctor"

That makes totals of:

131/1,231,500 cells for phase 1 (0.01% of transcribed cells) 451/1,292,564 cells for phase 2 (0.03% of transcribed cells)

A quick look at the phase 2 cases suggests that, other than ... being replaced with Ellipsis, these eval'd strings tend to make it through to the reduction output more or less as though they had not been eval'd.

Other strings are not successfully eval'd, which sometimes produces the SyntaxWarnings that alerted me to this problem in the first place. Unsuccessful eval results in the original value being retained, which I believe is the behaviour that we want.

I have confirmed that all of the SyntaxWarnings are produced by the attempted evals.

bogden1 commented 2 years ago

Given that the impact appears to be minimal, I think that this can be closed. However, we may want to draw Zooniverse's attention to it.

Update Reported it, they are tracking it here: https://github.com/zooniverse/aggregation-for-caesar/issues/654