Closed kindly closed 4 years ago
@jpmckinney It would be good for you to look over this pull request as you are the maintainer. I just want to check you are happy and understand the direction of the work.
We have build a mini transformation framework to do the conversion (with a class for each transform), which may be a little over engineered but means each transform can be tested mostly independently and means we can work on the transforms themselves in parallel. (they are mostly independent as some transforms need to be run in a certain order). Each transform has its own unit test(s) so we are confident they work. @duncandewhurst said he would look over the tests to make sure the logic of each rule in the docs is as expected but it may be valuable to look at that too.
The "Fields from docs sheet" in https://docs.google.com/spreadsheets/d/1xyKXbNktcfKm6siSzM_C7aHCOsOjWUQro5aU8ZYIHyc/edit#gid=690200833 outlines each mapping expressed in the OC4IDS docs and the transforms and config variable that are associated with them. It also has notes as to how we intend to do each transform.
There a quite a few transforms still left to do, most are straightforward but some may be too difficult or require too much human judgement over the original data to make work.
The CLI bit of the tool is mostly done but there will probably be a couple of extra arguments to add when the transforms are finished.
We were hoping to get this finished this week but due to xmas and a few illnesses we are now aiming for the end of next week and we think we have the capacity to do that.
Thanks, @kindly. I had opened #123 (and commented on the first commit) earlier, as there was no issue or Trello card that I could find to discuss. I've requested access to the spreadsheet.
My main feedback is around the overall design. OCDS Kit is careful to stream as much as possible (see docs). This new command should also stream wherever possible. The main points where changes would be required are:
ocdsmerge.sorted_releases
is called (which requires loading all input into memory, and which just sorts releases by date), given that (1) the input might have releases with many OCIDs and (2) ocdskit.combine.merge
will take care of sorting releases at the right time, without ever having the full input in memory (it sorts the releases for one OCID at a time).releases
that are passed in are only ever used to construct records (which are then re-worked into compiled releases with releases
and embededReleases
(typo) fields). In OCDS Kit, the pattern is to pipe data from one command to another, rather than to give a single command the responsibility of doing multiple steps. This also makes streaming feasible. If this command needs records, then it should only accept records. The documentation can show users an example of what to do if they are starting with releases (i.e. stream from the compile command – with whatever relevant options – into this new command). I can quickly close #122 so that the compile command has an option to output records (easy to stream) instead of record packages (harder to stream). Other observations:
move_to_end
. Otherwise, it just adds pressure to processing and memory. I think we can furthermore drop the dict_cls
option unless there is a real use case for it.if self.last_transforms[-1].success
logic doesn't make sense to me. For example, if TitleFromTender
always comes after Title
, then it makes sense. However, the current framework gives users freedom to put any transforms in any order, in which case the previous transform might be anything, and its success or failure might not imply anything for what TitleFromTender
should do…config
options (which sometimes turn a transform into a no-op) or through transform_list
(in which case they can simply remove the transform). The logic will be simplified if run_transforms
used config
to remove the transforms that would become no-ops from transform_list
.Some questions:
Also, I anticipate the tests file will become very long. Given that most tests follow the same format, I suggest creating a few reusable methods, so that it's easier for a developer or analyst to skim each test and quickly see what it's doing that's unique, and spend less time looking at the same boilerplate. This should be done sooner rather than later, as it will be more work the more tests are added.
Thanks, @kindly. I had opened #123 (and commented on the first commit) earlier, as there was no issue or Trello card that I could find to discuss. I've requested access to the spreadsheet.
Sorry thought that was shared.
My main feedback is around the overall design. OCDS Kit is careful to stream as much as possible (see docs). This new command should also stream wherever possible.
The design of the transform framework is to take a list of releases (or potentially records) and transform them into a single project. I was assuming that there would never be a project large enough not hold all its associated releases/contractingProcesses in memory, and as we are likely to stream out a project at a time, this has to be the case. Also, as the amount of contractingProcesses per project is likely to be small I assumed the overhead of looping through them per transform would not be a big burden.
As you said there is no way currently to differentiate a project from OCDS data, but when there is, I assumed that would be where the streaming would happen. So the streaming would happen at a higher level (something that calls) to the framework itself. There would have to be a mechanism very much like the combine command which gathered all releases/records of a particular project and would have to hold them in memory (or in a sqlite database like combine). This was the main reason for not requiring streaming of releases/records in the framework. I thought it unnecessary for streaming to happen at multiple levels considering there will always be a fairly limited amount of contractingProcesses per project. This could be avoided if we said that all projects had to appear together in the record input stream but even then a higher level function could manage this when we decided on the rules.
The main points where changes would be required are:
Presently, if the command receives releases, it sorts and merges them into records on its own.
- I don't know why
ocdsmerge.sorted_releases
is called (which requires loading all input into memory, and which just sorts releases by date), given that (1) the input might have releases with many OCIDs and (2)ocdskit.combine.merge
will take care of sorting releases at the right time, without ever having the full input in memory (it sorts the releases for one OCID at a time).- As far as I can tell, the
releases
that are passed in are only ever used to construct records (which are then re-worked into compiled releases withreleases
andembededReleases
(typo) fields).
Yes Duncan spotted that typo too!
There are transforms, for example Cost Estimate (not yet written) that do require knowledge of releases and they were ordered for convenience. These transforms are at the contractingProcess level so it could be possible that as long as the embeddedReleases (not linked releases) are always in the record then the input of the releases may not be necessary.
In OCDS Kit, the pattern is to pipe data from one command to another, rather than to give a single command the responsibility of doing multiple steps. This also makes streaming feasible. If this command needs records, then it should only accept records. The documentation can show users an example of what to do if they are starting with releases (i.e. stream from the compile command – with whatever relevant options – into this new command). I can quickly close #122 so that the compile command has an option to output records (easy to stream) instead of record packages (harder to stream).
Embeded Releases need to be in the input records for this work and without packages it limits the possibility of linked releases. There is no way in OCDS to specify both. This is a consideration especially as linked releases (contractingProcesses/releases) are the only thing currently in the OC4IDS standard. If we are happy to forgo linked releases in OC4IDS then this could be possible.
My preference for ergonomic reasons is it should that it should be possible to specify any of releases/release_packages/records (with embedded releases)/record packages (with embedded releases). At the moment only the first two are possible. I am nonetheless happy if these options could be restricted! The higher level streaming function could also have a much more restricted in input.
Presently, the command loads all compiled releases into memory, and then each transform iterates over the compiled releases.
- As in #123, I don't see why the framework couldn't instead pull a record from the input (after the above changes), re-work its compiled release (same as it does now), and then run just that one compiled release through each transform. That would avoid loading the entire input into memory.
Some transforms rely on the success or failure of the previous transform and that is only determined after all the previous compiled releases are looked at eg for Project Name we only look at tender.title if there is no planning.project.title in ANY compiled release. This could be achieved by holding some state against the object and then having a wrap up function for each class the decides the final logic, but such code is much harder to reason about and to test, so seemed unnecessary. Also aggregate transforms (like sums and lasts) are more difficult to write and test as require the holding of state for each one until the next record comes through. So as I did not see a problem with memory or speed if we are only processing one project at a time then the simplified logic seems preferable.
Other observations:
- It looks like the tool assumes that the input all belongs to the same project (though I don't see any documentation of this). In OC4IDS, there seems to be no documented mapping for project identifiers between OC4IDS and OCDS – though I think there had been discussions of how to do this. Anyhow, if the previous assumption will hold for all-time, then the above performance concerns are somewhat less serious. However, I anticipate that we will document a mapping for project identifiers, in which case it will (presumably) be simpler for users to stream all their OCDS data to this command, rather than come up with their own solution for bundling inputs per project. In that scenario, we definitely need to address the concerns above.
When this is agreed, this will be level the streaming should happen and should be above the framework. The transforms themselves can be oblivious to this change if their input is a list of records and a project_id and their output is a project. This separates concerns and makes the framework easier to reason about than if we added the concept of multiple projects within it.
- Given that Python 3.5 will soon be EOL, and given that Python 3.6 forward have insertion-ordered dicts, you only need OrderedDict if you are using methods like
move_to_end
. Otherwise, it just adds pressure to processing and memory. I think we can furthermore drop thedict_cls
option unless there is a real use case for it.
Happy to remove that!
- Some of the
if self.last_transforms[-1].success
logic doesn't make sense to me. For example, ifTitleFromTender
always comes afterTitle
, then it makes sense. However, the current framework gives users freedom to put any transforms in any order, in which case the previous transform might be anything, and its success or failure might not imply anything for whatTitleFromTender
should do…
The logic here is a bit slack it should have something like and last_tranform.__class__ == 'Title'
The idea was to not be allowed to put transforms in any order but the transform_list is more a testing convenience which we could remove from the public API.
- It looks like users can control the transforms both through the
config
options (which sometimes turn a transform into a no-op) or throughtransform_list
(in which case they can simply remove the transform). The logic will be simplified ifrun_transforms
usedconfig
to remove the transforms that would become no-ops fromtransform_list
.
This is a consideration. I initially had each class have a 'name' class variable and was going to use that as a way to specify transforms that would be run.
Some questions:
- Are there real use cases for giving users full control over both the choice and order of transforms? I can perhaps see a use case for choice, but I don't really see a use case for order.
See above I am currently thinking is that transform_list should not be a public parameter. It is more a testing convenience.
Also, I anticipate the tests file will become very long. Given that most tests follow the same format, I suggest creating a few reusable methods, so that it's easier for a developer or analyst to skim each test and quickly see what it's doing that's unique, and spend less time looking at the same boilerplate. This should be done sooner rather than later, as it will be more work the more tests are added.
The boilerplate is better later in the tests (and we need to back-port that) as it uses run_transforms and transform_list for the tests (we need to fix some earlier ones). There may be some simplifications to this and the source releases to make the data changes are easier to read.
Thanks for clarifying the requirements! To summarize:
self.releases
, and will need them to be sorted; as such, it makes sense to pre-sort in the initial state.packages
(to calculate the linked release), it makes sense for this command to handle merging on its own.Other questions:
self.records
ever being used in subclasses of BaseTransform. Is it needed?It'd be good to:
dict_cls
run_transforms
, rather than having each transform check a config variable to decide whether to run at all. That way, each transform has one less responsibility.run_transform
's) responsibility to determine whether to try another transform if one failed. As above, this simplifies the logic; transforms just transform – they don't care about context. They can just return True
if successful.If you're concerned about the ease of testing the last two changes, you can always have another method (_next_transform(…)
) that encapsulates the logic about which transform to run next. run_transforms
can then do something like while _next_transform(…):
.
Ideally, it'd be great to reduce the transforms down to just plain methods that accept output
and compiled_releases
(and I suppose releases
as well, for unimplemented transforms) (i.e. "modify this output using this input"), and return a tuple of the output and a success boolean. Having classes that store state and that care about the context of previous transforms is more complicated. This change should be fairly straightforward after the listed changes, as it should just require removing self.
and replacing the class
line with a def
line.
I'll let Duncan review the logic of transforms first, but I don't think some transforms are correct, e.g. locations are determined by taking the first location(s) found in any compiled release… but a project can occur in many locations, and should be the (unique) aggregation of all locations from all compiled releases (the OC4IDS mapping is perhaps ambiguous). There are also errors like /items
instead of /tender/items
, and I notice that AdministrativeEntity
starts by coping the administrativeEntity
parties, but then later claims to 'skip' the transform if multiple are found.
I have implemented most of you suggestions.
All transforms are now functions. However currently they all accept a single state object with compiled_releases, releases and output in. I do not want fixed arguments yet until I am sure there are not any awkward transforms that require different inputs that could be pre-processed.
I have removed config entirely from the transforms and it is now done but the running function.
Part of the requirements I got for @duncandewhurst was that a core set of transforms should be run always and there is config options for running the additional ones. At the moment I have detected the additional ones by looking in the docstring of the transform, this may not be ideal in the long run and we could just have a separate list of the optional ones.
Also the concept of success (even thought the function still return success or failure) is gone and I may remove the concept entirely soon. Currently all dependant transforms can tell from the output data of previous ones to see if the previous transform was successful. So at the moment there is nothing that needs ordering except 'contract_process_setup' which is run first.
Great! A few small tweaks:
transform_list
all-caps to indicate it's a constant, defined outside the method. When I first read the method, I thought it was a missing local variable.if var
instead of if var is not ""
or if len(var)
or if len(var) > 0
, and just use if not var
instead of if len(var) == 0
, etc.If success booleans are going away (great!), then I won't suggest anything relating to them.
@duncandewhurst Most transforms are done would you be able to spend some of today or tomorrow looking at the tests. You can comment in the spreadsheet if easier.
I merged master, to get pytest passing.
Update: Remaining failures are due to isort
in this PR's files.
Thanks @kindly - I've reviewed the tests but not the logic of the actual transforms in oc4ids_transforms.py
. I've made comments on individual tests, but some are generally applicable, e.g. handling clashing ids when copying from arrays and testing for one to many scenarios at project -> contracting process level and at contracting process -> award/contract levels
I am fairly happy with this pull request now. Everything can be improved but this feels like a decent first attempt.
There are a few transforms that are missing. Here is the issue about this #129
There are some contentious areas about how we deal with multiple process cases. Issue #130 has been raised to discuss them
@duncandewhurst @jpmckinney would be good review this.
Thanks @kindly I'm happy for this to be merged once the following is done:
add note to documentation on handling of objects in arrays where ids differ across contracting processes but other properties are the same
add note to documentation on handling of duplicate parties across contracting processes, including what happens when id
is provided but not identifer
.
add note to documentation on handling of project sector
add note to docs to explain that if you supply unpackaged releases then they will be embedded in the output
add note to docs to explain project scope output (adding items
and milestones
)
Open issue on expanding location transform tests
@duncandewhurst The above has been done.
@jpmckinney would you be able to do a final review?
@jpmckinney I have covered the points in your comments. The type checking would be good for you to look over again.
We should add decimal.Decimal
to the if-statement in cast_string
, in case the JSON was parsed into decimals (which is always the case with ijson). Similarly, in cast_number_or_zero
, we could do return decimal.Decimal(item)
and add decimal.Decimal
to the if-statement – to avoid losing accuracy if the data was parsed to decimals.
In both those methods, should we have if item:
before the log message like in check_type
? We don't check for the presence of keys in the calling code, so when a key isn't set, we'll end up with a log message due to the missing key, rather than due to a bad value.
We should add
decimal.Decimal
to the if-statement incast_string
, in case the JSON was parsed into decimals (which is always the case with ijson). Similarly, incast_number_or_zero
, we could doreturn decimal.Decimal(item)
and adddecimal.Decimal
to the if-statement – to avoid losing accuracy if the data was parsed to decimals.
For cast_string
this is no problem.
For cast_number_or_zero
then we have issue if some numbers are floats and some (that were strings originally) are decimals. As adding them up will means a crash. float + decimal does not work.
A decimal(float) is a lot uglier then a float(decimal) so keeping them as floats makes sense here.
The only other options are:
use decimal
option so we know that there will be no floats in the JSONWhat do you think?
In both those methods, should we have
if item:
before the log message like incheck_type
? We don't check for the presence of keys in the calling code, so when a key isn't set, we'll end up with a log message due to the missing key, rather than due to a bad value.
Good catch! I was meaning to do that then forgot.
A decimal(float) is a lot uglier then a float(decimal) so keeping them as floats makes sense here.
I'm not sure that there's an issue. decimal.decimal(1.1)
has an ugly internal representation (1.100000000000000088817841970012523233890533447265625
), but this ends up getting rounded again if it's converted to float (float(decimal.Decimal(1.1))
is 1.1
), which is what happens when OCDS Kit serializes back to JSON, as we haven't implemented a careful decimal serializer (see ocdskit.util._default
).
Great that is no problem then. Just tried to convert everything to decimial and if it fails return 0.
Great, thanks!
This is the main branch of the OCDS 2 OC4IDS conversion branch.