Closed StephenAbbott closed 1 year ago
Following discussion with @kd-ods and @ScatteredInk, updating this card not just to focus on better error messaging around maximum file size but also to document the maximum data file size allowed and see how we can increase that in order to allow analysis and reviewing of larger BODS datasets which are starting to be released.
Kadie/Jack: you referenced similar work done recently to expand the capacity of IATI's validator. Is this the version 2 work referenced at https://github.com/IATI/IATI-Validator-Web/releases and https://github.com/IATI/js-validator-api/releases? Appreciate it if you can add in any useful repo links so we can flesh out this issue
@odscjames Flagging this issue
@StephenAbbott Can you be exact about what you tried when you got the error?
@odscjames
I ran this on my laptop (If you download https://pypi.org/project/libcovebods/ there is a CLI tool included). It took 18 mins. I have the results file as JSON.
So first thought is that even if we get this down, to get this down to the 20 seconds or so that a web request can really take before people start to get impatient is going to be very challenging.
A first take on this is maybe to put in a message queue, a background worker and a holding screen saying "please wait, your file is being processed - don't leave this page".
18 minutes is still a long time to stare at a web page - ideally we would then work to get it down or at least provide some kind of progress count somehow. But at least it's better than a time out and a crash.
In terms of documenting a limit, it's hard to be exact as to what a limit would be. If we create a holding page for large files, we can put messages there and then people will see them only when relevant.
I just ran /usr/bin/time -v
to get Maximum resident set size (kbytes): 3234076
. 3GB.
Currently the entire JSON file is loaded in memory in one go. Because of the streaming design of BODS data we should be able to bring in https://pypi.org/project/ijson/ and make a big dent in that.
To attempt to estimate how much, I opened a python prompt and loaded the JSON file into RAM only. ps aux
reports VSZ=1681184 & RSS=1652408. ~1.6GB.
That seems like a large gap, but we have to keep a lot of state as we process the file so that's maybe fair enough (eg we keep a list of ID's so we can work out where there are missing statements or statements out of order). If that gets to be a problem, we could start storing state in SQLite or something.
For planning purposes, it may be useful to think about potential dataset sizes here (based on these stats, which are odd-looking but OK for estimation purposes.):
So first thought is that even if we get this down, to get this down to the 20 seconds or so that a web request can really take before people start to get impatient is going to be very challenging.
Thanks @odscjames - yes, I think our main objective here is not pushing users with big files away from the validator. I'd be interested in thinking about a "big file" strategy that explicitly tries to avoid validating the whole thing, until users are ready, i.e. does ijson give us the capability to stream a sample of the data (say "keep going until you hit 10 ownership-or-control statements or it seems pretty likely that no ownership-or-control statements are going to turn up" and then validate that sample (or warn the user that it is a bad sample)? This is basically what I used to do (minus any clever logic) when looking at bulk data from the OO register and companies house using jq.
@ScatteredInk there's a lot of interesting things here, and this is an issue that's come up in several standards that we work with.
The first speed challenge is around getting the data to the server: uploads from home connections are usually quite slow (it would take me ~3 minutes to upload a 300MB file from home), and even across a connection between servers at 100Mbps, it would take 24 seconds. Moving compressed files is much quicker, of course, but not all compression algorithms allow for streaming. It's also not entirely straightforward to do operations on files that are still being uploaded - it's possible (although I'm not sure about for the DRT specifically!) - just has some extra complexity in principle.
If the data is updated relatively rarely (e.g. daily, or weekly), one option could be to routinely download the data, such as we do for the 360Giving data store, and run a service whereby stats are available on that data, rather than "live". For 360, we get the latest data every night, and we run the Data Quality Tool (ie, their DRT), and store the results for internal use. We have a forthcoming tool to allow for publishers to see their own data quality stats, as well. Obviously, with an overnight run, we can (and do) run computationally expensive processes without worrying about time - which avoids the problem described here.
Self-service is a bit trickier, but with some appropriate controls and/or manual intervention, this could be offered as a service to implementers: register an endpoint with the helpdesk to get regular monitoring? Or, trusted users can register for an account to log in and set their own endpoints?
Depending on the users and their relative technical expertise, providing advice or tooling on "preparing a sample" might be an appropriate thing to do. I'm not a huge fan of moving the burden to the user like that, but it might be a better experience overall. I think that with BODS you should just be able to do a head -n1000
or something and still get valid data, but that may well be a step too far.
Leaving all that aside - there's always possibilities around doing a "first pass" of the data. I think that in the current DRT implementation that would probably slow down the overall response time, although we have done some work around that in our Standards Lab project that we might be able to draw from. Such a "first pass" could do things like:
...and then "fail fast", if those conditions aren't met. But, that requires a very different approach than is currently taken in the DRT. Could well be worth it, mind!
So in this mornings call I said I'd post some things here:
Rob raises a lot of interesting ideas but I think some of them may cross over into grounds the data developer library may cover, and I would suggest some thinking about where functionality belongs could be good.
We should take a piecemeal approach to this - there are many options here and we should break them down and look at them a piece at a time, and see what we could do. With that in mind I said I'd try and estimate 2 things:
Please wait holding screen & work done in background - the current model where the work is done in the same request really limits the size of file to something that can be processed in the 20 seconds or so before people get impatient. Adding a screen and background worker has 2 advantages:
sample mode - as a first step, add a sample mode that has a flag on the home page of the tool so when you submit data you submit full data or a sample. In sample mode, certain tests are not run - there wouldn't be any tests that need a full file to check properly, and thus can't be checked in a sample file. (eg Person or Entity statement not used by a ownership/control statement). We can then provide simple instructions to people to make a sample file locally and upload.
Jack talks about the idea of a sample mode that keeps working until it sees a certain amount of different types of statements - once we have ijson sorted out we could add this to the sample mode as a next step, but until we do I don't think this would gain much. Large files would still have high memory use.
I'll work out estimates for those 2.
Later, I would guess that ijson would definetly help memory use and maybe help speed performance. I would guess that sqlite would definetly help memory use but might actually hurt speed performance. We can come back to those later.
Given the low level of usage of the BODS Data Review Tool at the moment and the fact that we can learn a lot by working more closely with implementers, I'm leaning more towards @robredpath's suggestions above about us providing some clear advice/actions for implementers to follow who try to use the DRT to analyse a large file.
Great for us to estimate the time required to make the following changes, @odscjames:
Update documentation on datareview.openownership.org to recommend current use of DRT only for .json files smaller than 100mb. Prompt users to email support@openownership.org if they want help to analyse larger files. Exact language TBC. Would we need to provide different guidance for maximum recommended file sizes across the four accepted formats?
If a user proceeds to upload a .json file larger than 100mb, we update the Loading screen to do one of the following (decision on which to use will depend on estimates for technical work required for each):
Then, once we have put these changes in place, we can have more of a discussion about long-term aims for BODS DRT development.
Ha! You beat me to posting your comment here by two minutes, @odscjames 😀
@odscjames I just an idea. Words to strike fear into the heart of any developer.
What about if we made a desktop app (Electron?) which just helps users prepare samples for upload to the DRT? And, maybe, does quick checks to make sure that it's worth their time (the file is actually JSON, contains at least one of each of the statements that it needs to have, etc)?
I don't know anything about BODS users but my observation when delivering training for other data standards has been that the move to the command line is a Big Deal for the majority of users. Something that they can run on the desktop with a graphical interface - even if it doesn't have any of the power of the full app - could be a useful thing to have as a first step.
@StephenAbbott - ok if @odscjames goes ahead and works on #80? While also coming up with an estimate of how long #81 would take?
@kd-ods Sure. Will check in with @odscjames over on https://github.com/openownership/cove-bods/issues/80 with next steps
Words to strike fear into the heart of any developer. .... What about if we made a desktop app
It's alright, I can also have ideas :-) If we think that people have large local files and uploading them is an issue (bandwidth and/or privacy worries) and use of CLI is an issue, we could just take libcovebods and make a GUI for it such that they can run all the checks locally. There are various Python GUI libs and also libs that let you package your python app into a nice installer: https://pypi.org/project/pyinstaller/
@odscjames - can I just check: the tool doesn't handle JSON lines does it? We say in the docs that 'JSON Lines MAY also be used when creating large files'.
doesn't handle JSON lines does it
No, not at the moment. Adding that in itself should not be to much work, but it implies people can upload really big files and that will take more work - both to process them without using too much server resources and to make sure the UI is still readable/understandable when it's showing 17,000 errors!
Thanks @odscjames. Will be good to check in on this with you and @kd-ods next week.
All the files published via https://bods-data.openownership.org for UK/DK/SK and the Register use JSONLines and would like to understand how much more work it would be to make changes and make the UI still understandable.
@odscjames and I have just talked through the final work on #81 and #79. @StephenAbbott - the suggestion is that we finalise and test that work then release it in the next day or two.
Beyond that, the next two tasks to take on (timing of work to be discussed) are:
We haven't yet added tickets for those two tasks.
Thanks @kd-ods - and welcome back @odscjames! Happy for you to release this work if you're happy with it and will be good to discuss when we can book in the JSONLines / large files work once tickets raised for those.
Improve test processing and reporting for very large files
To be clear that's for both JSON lines and normal JSON, the work I have in mind will cover both. Can comment more on ticket.
See links above ^ @odscjames. New tickets added.
Closing this issue following completion of work listed in https://github.com/openownership/cove-bods/pull/97
During the process of checking Latvia's initial release of beneficial ownership data from January 2021, we attempted to use the DRT to analyse how closely their data complies with BODS. But the file could not be analysed as it is too big.
No file size limit is explicitly given in the notes on this page so documentation should be updated to include an explanation of the limit.