open-contracting / ocdskit

A suite of command-line tools for working with OCDS data
https://ocdskit.readthedocs.io
BSD 3-Clause "New" or "Revised" License
17 stars 7 forks source link

OC4IDS transforms cli #125

Closed kindly closed 4 years ago

kindly commented 4 years ago

This is the main branch of the OCDS 2 OC4IDS conversion branch.

kindly commented 4 years ago

@jpmckinney It would be good for you to look over this pull request as you are the maintainer. I just want to check you are happy and understand the direction of the work.

Current Status

We have build a mini transformation framework to do the conversion (with a class for each transform), which may be a little over engineered but means each transform can be tested mostly independently and means we can work on the transforms themselves in parallel. (they are mostly independent as some transforms need to be run in a certain order). Each transform has its own unit test(s) so we are confident they work. @duncandewhurst said he would look over the tests to make sure the logic of each rule in the docs is as expected but it may be valuable to look at that too.

The "Fields from docs sheet" in https://docs.google.com/spreadsheets/d/1xyKXbNktcfKm6siSzM_C7aHCOsOjWUQro5aU8ZYIHyc/edit#gid=690200833 outlines each mapping expressed in the OC4IDS docs and the transforms and config variable that are associated with them. It also has notes as to how we intend to do each transform.

There a quite a few transforms still left to do, most are straightforward but some may be too difficult or require too much human judgement over the original data to make work.

The CLI bit of the tool is mostly done but there will probably be a couple of extra arguments to add when the transforms are finished.

Work still todo

We were hoping to get this finished this week but due to xmas and a few illnesses we are now aiming for the end of next week and we think we have the capacity to do that.

jpmckinney commented 4 years ago

Thanks, @kindly. I had opened #123 (and commented on the first commit) earlier, as there was no issue or Trello card that I could find to discuss. I've requested access to the spreadsheet.

My main feedback is around the overall design. OCDS Kit is careful to stream as much as possible (see docs). This new command should also stream wherever possible. The main points where changes would be required are:

  1. Presently, if the command receives releases, it sorts and merges them into records on its own.
    1. I don't know why ocdsmerge.sorted_releases is called (which requires loading all input into memory, and which just sorts releases by date), given that (1) the input might have releases with many OCIDs and (2) ocdskit.combine.merge will take care of sorting releases at the right time, without ever having the full input in memory (it sorts the releases for one OCID at a time).
    2. As far as I can tell, the releases that are passed in are only ever used to construct records (which are then re-worked into compiled releases with releases and embededReleases (typo) fields). In OCDS Kit, the pattern is to pipe data from one command to another, rather than to give a single command the responsibility of doing multiple steps. This also makes streaming feasible. If this command needs records, then it should only accept records. The documentation can show users an example of what to do if they are starting with releases (i.e. stream from the compile command – with whatever relevant options – into this new command). I can quickly close #122 so that the compile command has an option to output records (easy to stream) instead of record packages (harder to stream).
  2. Presently, the command loads all compiled releases into memory, and then each transform iterates over the compiled releases.
    1. As in #123, I don't see why the framework couldn't instead pull a record from the input (after the above changes), re-work its compiled release (same as it does now), and then run just that one compiled release through each transform. That would avoid loading the entire input into memory.

Other observations:

  1. It looks like the tool assumes that the input all belongs to the same project (though I don't see any documentation of this). In OC4IDS, there seems to be no documented mapping for project identifiers between OC4IDS and OCDS – though I think there had been discussions of how to do this. Anyhow, if the previous assumption will hold for all-time, then the above performance concerns are somewhat less serious. However, I anticipate that we will document a mapping for project identifiers, in which case it will (presumably) be simpler for users to stream all their OCDS data to this command, rather than come up with their own solution for bundling inputs per project. In that scenario, we definitely need to address the concerns above.
  2. Given that Python 3.5 will soon be EOL, and given that Python 3.6 forward have insertion-ordered dicts, you only need OrderedDict if you are using methods like move_to_end. Otherwise, it just adds pressure to processing and memory. I think we can furthermore drop the dict_cls option unless there is a real use case for it.
  3. Some of the if self.last_transforms[-1].success logic doesn't make sense to me. For example, if TitleFromTender always comes after Title, then it makes sense. However, the current framework gives users freedom to put any transforms in any order, in which case the previous transform might be anything, and its success or failure might not imply anything for what TitleFromTender should do…
  4. It looks like users can control the transforms both through the config options (which sometimes turn a transform into a no-op) or through transform_list (in which case they can simply remove the transform). The logic will be simplified if run_transforms used config to remove the transforms that would become no-ops from transform_list.

Some questions:

  1. Are there real use cases for giving users full control over both the choice and order of transforms? I can perhaps see a use case for choice, but I don't really see a use case for order.
jpmckinney commented 4 years ago

Also, I anticipate the tests file will become very long. Given that most tests follow the same format, I suggest creating a few reusable methods, so that it's easier for a developer or analyst to skim each test and quickly see what it's doing that's unique, and spend less time looking at the same boilerplate. This should be done sooner rather than later, as it will be more work the more tests are added.

kindly commented 4 years ago

Thanks, @kindly. I had opened #123 (and commented on the first commit) earlier, as there was no issue or Trello card that I could find to discuss. I've requested access to the spreadsheet.

Sorry thought that was shared.

My main feedback is around the overall design. OCDS Kit is careful to stream as much as possible (see docs). This new command should also stream wherever possible.

The design of the transform framework is to take a list of releases (or potentially records) and transform them into a single project. I was assuming that there would never be a project large enough not hold all its associated releases/contractingProcesses in memory, and as we are likely to stream out a project at a time, this has to be the case. Also, as the amount of contractingProcesses per project is likely to be small I assumed the overhead of looping through them per transform would not be a big burden.

As you said there is no way currently to differentiate a project from OCDS data, but when there is, I assumed that would be where the streaming would happen. So the streaming would happen at a higher level (something that calls) to the framework itself. There would have to be a mechanism very much like the combine command which gathered all releases/records of a particular project and would have to hold them in memory (or in a sqlite database like combine). This was the main reason for not requiring streaming of releases/records in the framework. I thought it unnecessary for streaming to happen at multiple levels considering there will always be a fairly limited amount of contractingProcesses per project. This could be avoided if we said that all projects had to appear together in the record input stream but even then a higher level function could manage this when we decided on the rules.

The main points where changes would be required are:

  1. Presently, if the command receives releases, it sorts and merges them into records on its own.

    1. I don't know why ocdsmerge.sorted_releases is called (which requires loading all input into memory, and which just sorts releases by date), given that (1) the input might have releases with many OCIDs and (2) ocdskit.combine.merge will take care of sorting releases at the right time, without ever having the full input in memory (it sorts the releases for one OCID at a time).
    2. As far as I can tell, the releases that are passed in are only ever used to construct records (which are then re-worked into compiled releases with releases and embededReleases (typo) fields).

Yes Duncan spotted that typo too!

There are transforms, for example Cost Estimate (not yet written) that do require knowledge of releases and they were ordered for convenience. These transforms are at the contractingProcess level so it could be possible that as long as the embeddedReleases (not linked releases) are always in the record then the input of the releases may not be necessary.

In OCDS Kit, the pattern is to pipe data from one command to another, rather than to give a single command the responsibility of doing multiple steps. This also makes streaming feasible. If this command needs records, then it should only accept records. The documentation can show users an example of what to do if they are starting with releases (i.e. stream from the compile command – with whatever relevant options – into this new command). I can quickly close #122 so that the compile command has an option to output records (easy to stream) instead of record packages (harder to stream).

Embeded Releases need to be in the input records for this work and without packages it limits the possibility of linked releases. There is no way in OCDS to specify both. This is a consideration especially as linked releases (contractingProcesses/releases) are the only thing currently in the OC4IDS standard. If we are happy to forgo linked releases in OC4IDS then this could be possible.

My preference for ergonomic reasons is it should that it should be possible to specify any of releases/release_packages/records (with embedded releases)/record packages (with embedded releases). At the moment only the first two are possible. I am nonetheless happy if these options could be restricted! The higher level streaming function could also have a much more restricted in input.

  1. Presently, the command loads all compiled releases into memory, and then each transform iterates over the compiled releases.

    1. As in #123, I don't see why the framework couldn't instead pull a record from the input (after the above changes), re-work its compiled release (same as it does now), and then run just that one compiled release through each transform. That would avoid loading the entire input into memory.

Some transforms rely on the success or failure of the previous transform and that is only determined after all the previous compiled releases are looked at eg for Project Name we only look at tender.title if there is no planning.project.title in ANY compiled release. This could be achieved by holding some state against the object and then having a wrap up function for each class the decides the final logic, but such code is much harder to reason about and to test, so seemed unnecessary. Also aggregate transforms (like sums and lasts) are more difficult to write and test as require the holding of state for each one until the next record comes through. So as I did not see a problem with memory or speed if we are only processing one project at a time then the simplified logic seems preferable.

Other observations:

  1. It looks like the tool assumes that the input all belongs to the same project (though I don't see any documentation of this). In OC4IDS, there seems to be no documented mapping for project identifiers between OC4IDS and OCDS – though I think there had been discussions of how to do this. Anyhow, if the previous assumption will hold for all-time, then the above performance concerns are somewhat less serious. However, I anticipate that we will document a mapping for project identifiers, in which case it will (presumably) be simpler for users to stream all their OCDS data to this command, rather than come up with their own solution for bundling inputs per project. In that scenario, we definitely need to address the concerns above.

When this is agreed, this will be level the streaming should happen and should be above the framework. The transforms themselves can be oblivious to this change if their input is a list of records and a project_id and their output is a project. This separates concerns and makes the framework easier to reason about than if we added the concept of multiple projects within it.

  1. Given that Python 3.5 will soon be EOL, and given that Python 3.6 forward have insertion-ordered dicts, you only need OrderedDict if you are using methods like move_to_end. Otherwise, it just adds pressure to processing and memory. I think we can furthermore drop the dict_cls option unless there is a real use case for it.

Happy to remove that!

  1. Some of the if self.last_transforms[-1].success logic doesn't make sense to me. For example, if TitleFromTender always comes after Title, then it makes sense. However, the current framework gives users freedom to put any transforms in any order, in which case the previous transform might be anything, and its success or failure might not imply anything for what TitleFromTender should do…

The logic here is a bit slack it should have something like and last_tranform.__class__ == 'Title'

The idea was to not be allowed to put transforms in any order but the transform_list is more a testing convenience which we could remove from the public API.

  1. It looks like users can control the transforms both through the config options (which sometimes turn a transform into a no-op) or through transform_list (in which case they can simply remove the transform). The logic will be simplified if run_transforms used config to remove the transforms that would become no-ops from transform_list.

This is a consideration. I initially had each class have a 'name' class variable and was going to use that as a way to specify transforms that would be run.

Some questions:

  1. Are there real use cases for giving users full control over both the choice and order of transforms? I can perhaps see a use case for choice, but I don't really see a use case for order.

See above I am currently thinking is that transform_list should not be a public parameter. It is more a testing convenience.

Also, I anticipate the tests file will become very long. Given that most tests follow the same format, I suggest creating a few reusable methods, so that it's easier for a developer or analyst to skim each test and quickly see what it's doing that's unique, and spend less time looking at the same boilerplate. This should be done sooner rather than later, as it will be more work the more tests are added.

The boilerplate is better later in the tests (and we need to back-port that) as it uses run_transforms and transform_list for the tests (we need to fix some earlier ones). There may be some simplifications to this and the source releases to make the data changes are easier to read.

jpmckinney commented 4 years ago

Thanks for clarifying the requirements! To summarize:

Other questions:

It'd be good to:

If you're concerned about the ease of testing the last two changes, you can always have another method (_next_transform(…)) that encapsulates the logic about which transform to run next. run_transforms can then do something like while _next_transform(…):.

Ideally, it'd be great to reduce the transforms down to just plain methods that accept output and compiled_releases (and I suppose releases as well, for unimplemented transforms) (i.e. "modify this output using this input"), and return a tuple of the output and a success boolean. Having classes that store state and that care about the context of previous transforms is more complicated. This change should be fairly straightforward after the listed changes, as it should just require removing self. and replacing the class line with a def line.

I'll let Duncan review the logic of transforms first, but I don't think some transforms are correct, e.g. locations are determined by taking the first location(s) found in any compiled release… but a project can occur in many locations, and should be the (unique) aggregation of all locations from all compiled releases (the OC4IDS mapping is perhaps ambiguous). There are also errors like /items instead of /tender/items, and I notice that AdministrativeEntity starts by coping the administrativeEntity parties, but then later claims to 'skip' the transform if multiple are found.

kindly commented 4 years ago

I have implemented most of you suggestions.

All transforms are now functions. However currently they all accept a single state object with compiled_releases, releases and output in. I do not want fixed arguments yet until I am sure there are not any awkward transforms that require different inputs that could be pre-processed.

I have removed config entirely from the transforms and it is now done but the running function.

Part of the requirements I got for @duncandewhurst was that a core set of transforms should be run always and there is config options for running the additional ones. At the moment I have detected the additional ones by looking in the docstring of the transform, this may not be ideal in the long run and we could just have a separate list of the optional ones.

Also the concept of success (even thought the function still return success or failure) is gone and I may remove the concept entirely soon. Currently all dependant transforms can tell from the output data of previous ones to see if the previous transform was successful. So at the moment there is nothing that needs ordering except 'contract_process_setup' which is run first.

jpmckinney commented 4 years ago

Great! A few small tweaks:

If success booleans are going away (great!), then I won't suggest anything relating to them.

kindly commented 4 years ago

@duncandewhurst Most transforms are done would you be able to spend some of today or tomorrow looking at the tests. You can comment in the spreadsheet if easier.

jpmckinney commented 4 years ago

I merged master, to get pytest passing.

Update: Remaining failures are due to isort in this PR's files.

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-0.008%) to 95.896% when pulling ed636634ab91497c2fad773da68dc8739f793e4d on oc4ids-transforms-cli into 3ef9dd5833533527c64069b9316cdce0c8065949 on master.

duncandewhurst commented 4 years ago

Thanks @kindly - I've reviewed the tests but not the logic of the actual transforms in oc4ids_transforms.py. I've made comments on individual tests, but some are generally applicable, e.g. handling clashing ids when copying from arrays and testing for one to many scenarios at project -> contracting process level and at contracting process -> award/contract levels

kindly commented 4 years ago

I am fairly happy with this pull request now. Everything can be improved but this feels like a decent first attempt.

There are a few transforms that are missing. Here is the issue about this #129

There are some contentious areas about how we deal with multiple process cases. Issue #130 has been raised to discuss them

@duncandewhurst @jpmckinney would be good review this.

duncandewhurst commented 4 years ago

Thanks @kindly I'm happy for this to be merged once the following is done:

kindly commented 4 years ago

@duncandewhurst The above has been done.

@jpmckinney would you be able to do a final review?

kindly commented 4 years ago

@jpmckinney I have covered the points in your comments. The type checking would be good for you to look over again.

jpmckinney commented 4 years ago

We should add decimal.Decimal to the if-statement in cast_string, in case the JSON was parsed into decimals (which is always the case with ijson). Similarly, in cast_number_or_zero, we could do return decimal.Decimal(item) and add decimal.Decimal to the if-statement – to avoid losing accuracy if the data was parsed to decimals.

In both those methods, should we have if item: before the log message like in check_type? We don't check for the presence of keys in the calling code, so when a key isn't set, we'll end up with a log message due to the missing key, rather than due to a bad value.

kindly commented 4 years ago

We should add decimal.Decimal to the if-statement in cast_string, in case the JSON was parsed into decimals (which is always the case with ijson). Similarly, in cast_number_or_zero, we could do return decimal.Decimal(item) and add decimal.Decimal to the if-statement – to avoid losing accuracy if the data was parsed to decimals.

For cast_string this is no problem.

For cast_number_or_zero then we have issue if some numbers are floats and some (that were strings originally) are decimals. As adding them up will means a crash. float + decimal does not work.

A decimal(float) is a lot uglier then a float(decimal) so keeping them as floats makes sense here.

The only other options are:

What do you think?

In both those methods, should we have if item: before the log message like in check_type? We don't check for the presence of keys in the calling code, so when a key isn't set, we'll end up with a log message due to the missing key, rather than due to a bad value.

Good catch! I was meaning to do that then forgot.

jpmckinney commented 4 years ago

A decimal(float) is a lot uglier then a float(decimal) so keeping them as floats makes sense here.

I'm not sure that there's an issue. decimal.decimal(1.1) has an ugly internal representation (1.100000000000000088817841970012523233890533447265625), but this ends up getting rounded again if it's converted to float (float(decimal.Decimal(1.1)) is 1.1), which is what happens when OCDS Kit serializes back to JSON, as we haven't implemented a careful decimal serializer (see ocdskit.util._default).

kindly commented 4 years ago

Great that is no problem then. Just tried to convert everything to decimial and if it fails return 0.

jpmckinney commented 4 years ago

Great, thanks!