transportenergy / database

Tools for accessing and maintaining the iTEM model & historical databases
https://transportenergy.rtfd.io
GNU General Public License v3.0
24 stars 8 forks source link

T001: magnitude error in raw data for CHN #32

Closed noussan closed 3 years ago

noussan commented 4 years ago

There seems to be an error in the T001 data for China, two orders of magnitude of difference for data up to 2001 and after 2002.

khaeru commented 3 years ago

@noussan thanks for looking closely and reporting this. @hlinero and I had a look and confirmed the issue.

Because this is a first instance of handling such reports from the community, I want to give an example of a complete, detailed description of both the issue and the fix. FYI @soniayeh.

A full description:

I have transferred this issue from the transportenergy/metadata that contains the cached file, to the transportenergy/database repository, which contains the processing codes. The initial fix will be to modify these processing codes, specifically file T001.py, to:

  1. In the check() function, perform an arithmetic check to confirm that the error exists. For instance, interpolate between the 1985 and 2002 data points, divide the results by the data points for 1990–2001; the values should be ~ 100, or >90, or similar.
  2. In the process() function, adjust the data by multiplying by 100.

The longer-term fix is to notify the data provider (ITF-OECD) of the error in the data they are providing. They can then implement a fix, as & when they see fit.

The check (1) is critical because it will allow us to automatically detect when the error is, or is not, present in the input data. When the input data is corrected, then we can also remove the adjustment in the processing step (2). Without this check, the adjustment might be wrongly applied twice: first by the provider, and then again by the iTEM code, which would produce values that were erroneously too high by two orders of magnitude.

khaeru commented 3 years ago

There was a comment on the PR https://github.com/transportenergy/database/pull/40#issuecomment-772065295:

The fix was to multiply 100 to data 1985-2001. But I think the fix should be dividing 100 to data 2002 and after. Can you double check?

Here are a few rows from the input data:

"CHN","China","T-SEA-CAB","Coastal shipping (national transport)","1985","1985","TONNEKM","Tonnes-kilometres","6","Millions",,,532900,,
"CHN","China","T-SEA-CAB","Coastal shipping (national transport)","1990","1990","TONNEKM","Tonnes-kilometres","6","Millions",,,8141,,
[…]
"CHN","China","T-SEA-CAB","Coastal shipping (national transport)","2001","2001","TONNEKM","Tonnes-kilometres","6","Millions",,,20873,,
"CHN","China","T-SEA-CAB","Coastal shipping (national transport)","2002","2002","TONNEKM","Tonnes-kilometres","6","Millions",,,2173300,,
[…]
"CHN","China","T-SEA-CAB","Coastal shipping (national transport)","2017","2017","TONNEKM","Tonnes-kilometres","6","Millions",,,5508400,,

@soniayeh, to be clear, the interpretation (call it “A”) of the problem was that:

This was based on what was in the notebooks that you and @hlinero developed together.

What you're suggesting is the opposite, interpretation (B):

Here's another row:

"USA","United States","T-SEA-CAB","Coastal shipping (national transport)","2002","2002","TONNEKM","Tonnes-kilometres","6","Millions",,,384977,,

If this is correct, U.S. coastal shipping in 2002 was 3.85 × 10¹¹ t km. Then either:

khaeru commented 3 years ago

In Slack, @soniayeh replied:

Yes, I was thinking (B) should be more correct, but I am not sure. I wonder what the ITF colleagues think?

Over at #57, @RachelePoggi wrote:

If I understood correctly, T001 refers to coastal shipping. I think this issue has been fixed in our database. Maybe you can give a look and see if i looked at the right variable

and then @soniayeh wrote:

Yes. You are correct. The source (ITF) has corrected the data. Therefore [w]e should remove the temporary code fix in #40 to the script T001, without applying the multiplier of 100 for data up to 2001.

I can't understand from this exchange what "this issue" turned out to be actually be—was it (A) or (B)? What specifically was changed to fix it?