statisticssweden / PxWeb

Web application for disseminating statistical tables
https://www.scb.se/pxweb
Apache License 2.0
33 stars 23 forks source link

Parsing px files #482

Closed sabinem closed 1 year ago

sabinem commented 1 year ago

Question Hello I have build a parser for px file format. It works overall well, but I struggle with an important question, that I feel unable to solve. Therefore I decided to reach out to you in regards of this question on the px file format.

The keyword DATA in a px-file contains all the data in a sequencing that is determined by the variables and their values in HEADING and STUB.

Now I have two px-files:

  1. "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-0702000000_101"
  2. "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-0602000000_107"

They both have only a single variable for HEADING and multiple variables for STUB:

  1. STUB="Beobachtungseinheit","Kanton","Landwirtschaftliche Produktionszone","Groessenklasse","Betriebssystem","Betriebsform";
  2. STUB="Wirtschaftsabteilung","Ausbildungsniveau","Schwierigkeiten","Gewichtung";

Now in the file 1px-x-0602000000_107 the sequencing is done by moving forward through the variables of the STUB, which is what I expected in every such case. But it turned out, that the sequencing for px-x-0702000000_101 needs to be done by moving backward through the variables of the STUB: I found that out when comparing against the web interface of this dataset: https://www.pxweb.bfs.admin.ch/pxweb/en/px-x-0702000000_101/px-x-0702000000_101/px-x-0702000000_101.px/

So my question is: how can I see from the file what the correct sequencing of the variable is. I could not find this information in the downloaded files. And without that information, the px files can not be parsed correctly. In order for the downloads to be useful, they should be self contained and the order should be specified somewhere in the files themselves.

I would appreciate it a lot, if you could help me with this. Thanks in advance for an answer.

likp commented 1 year ago

Hi @sabinem,

The data is sequenced in the order of the variables and the values in the stub and heading. It is easiest to describe as a outer and an a inner loop. The outer is looping though the variables in the stub, while the inner is looping over the variables in the heading. You should never have to loop backwards. Look at the screenshots below. The first line of data corresponds to the first line of data in the PX file the second is corresponding to the second line of data in the PX file etc for both the 101 and 107 file. Although I might have misunderstood you question :-).

image image

sabinem commented 1 year ago

@likp Thanks so much for your answer, that is the answer I was hoping for. To me this means that there must be a bug in your serialiser:

Let's look together at this file here: "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-0702000000_101" Since my downloaded file looks different from you screenshots, I will take you through my file.

I have copied the lines that matter from the downloaded px file:

STUB[en]="Observation unit","Canton","Area of agricultural production","Size class UAA","Farmholding system","Farmholding form";
HEADING[en]="Year";
"Cheptel - Volailles","Cheptel - Autres animaux";
VALUES[en]("Observation unit")="Farmholdings","Employees - Total",
"Full-time employees (75% or more)","Part-time employees (50-75%)",
"Part-time employees 2 (less than 50%)","Employees - Men","Employees - Women",
"Employees - Women, Manager label","Employees - Swiss","Employees - Foreign nationals",
"Family employees","Beef cattle and cows farm","Horse and other equine farm","Sheep farm",
"Goat farm","Pig farms","Poultry farm","Farms with other animals",
"Utilised agricultural area (UAA) (in hectares)","UAA - Arable land (in hectares)",
"UAA - Grassland (in hectares)","UAA - Permanent crops (in hectares)",
"UAA - Other utilised agricultural area (in hectares)","Livestock - Beef cattle and cows",
"Livestock - Horses and other equines","Livestock - Sheep","Livestock - Goats",
"Livestock - Pigs livestock","Livestock - Poultry","Livestock - Other animals";
VALUES[en]("Canton")="Switzerland","Z¸rich","Bern / Berne","Luzern","Uri","Schwyz","Obwalden",
"Nidwalden","Glarus","Zug","Fribourg / Freiburg","Solothurn","Basel-Stadt","Basel-Landschaft",
"Schaffhausen","Appenzell Ausserrhoden","Appenzell Innerrhoden","St. Gallen",
"Graub¸nden / Grigioni / Grischun","Aargau","Thurgau","Ticino","Vaud","Valais / Wallis","Neuch‚tel",
"Geneva","Jura";
VALUES[en]("Area of agricultural production")="Area - total","Plain area","Hill area",
"Mountain area 1","Mountain area 2","Mountain area 3","Mountain area 4","Not defined";
VALUES[en]("Size class UAA")="Size class - total","less then 1 hectare","from 1 up to 3 hectares",
"from 3 up to 5 hectares","from 5 up to 10 hectares","from 10 up to 20 hectares",
"from 20 up to 30 hectares","from 30 up to 50 hectares","50 hectares and more";
VALUES[en]("Farmholding system")="Farmholding system - total","Organic farming",
"Conventional farming","Not defined";
VALUES[en]("Farmholding form")="Farmholding form - total","Full-time farm","Part time farm",
"Undecided";
VALUES[en]("Year")="2022","2021","2020","2019","2018","2017","2016","2015","2014","2013","2012",
"2011","2010","2009","2008","2007","2006","2005","2004","2003","2002","2001","2000","1999","1998",
"1997","1996","1990","1985","1980","1975";

Now if you look at the first two lines of the DATA:

DATA=
48344 48864 49363 50038 50852 51620 52263 53253 54057 55207 56575 57617 59065 60034 60894 61764 62830 63627 64466 65866 67421 68784 70537 73591 76412 77730 79479 92815 98759 104453 111302 
34368 34775 35085 35611 36335 36716 37325 38044 38841 39344 40239 40613 41434 42214 43993 45034 45431 45911 46406 47126 48146 48817 49239 "..." "..." "..." 55951 64242 69954 74107 "..." 

What you expect is that the entries in each row represent the years, that is actually true. And the first line belongs to all first values in the STUB variables: "Observation unit","Canton","Area of agricultural production","Size class UAA","Farmholding. That is actually true and correct as you can see here by comparing with the webinterface: https://www.pxweb.bfs.admin.ch/pxweb/en/px-x-0702000000_101/px-x-0702000000_101/px-x-0702000000_101.px/table/tableViewLayout2/

Screenshot 2023-08-22 at 12 50 35

The second line should still go through the years, but should now relate to the second value in the first stub: "Employees - Total":

Screenshot 2023-08-22 at 12 53 10

But it does not: it represents instead the second value in the last STUB variable: "Full-time farm" as you can see here:

Screenshot 2023-08-22 at 12 55 31

So that is actually the mistake. And you are saying it shouldn't be like this. So then I think it must be that the bug happens when the data is downloaded.

Is my explanation clear enough. Would you agree, that it is a bug in the px software serialising the px data and metadata for download? When can we expect to get this addressed and solved?

likp commented 1 year ago

I see what you mean... The way we loop through the variables in the stub that we take the first value for the first variable the first variable of the second variable the first variable on the Nth variable in the Stub. This is the "first-row". "The second-row" would be the first value for each variable in the stub except for the Nth variable which would be the second value. You could say that we go depth-first when we loop trough the variables (both for stub and heading). I am guessing that is what you meant by going backwards?

In your case "Employees - Total" is the second value for the first variable in the stub so that would correspond the the data-row 31105 given by the formula below

(count(Canton) * count(Area of agricultural production) * count(Size class UAA) * count(Farmholding system) * count(Farmholding form) + 1 = 27 * 8 * 9 * 4 * 4 + 1 = 31105)
sabinem commented 1 year ago

Yes @likp exactly:

You could say that we go depth-first when we loop trough the variables (both for stub and heading). I am guessing that is what you meant by going backwards?

yes for this px file you could say that, but then for another dataset such as this one here: https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-0602000000_107 the sequencing is the other way around.

There are two things to this:

Thanks a lot for looking into this further.

likp commented 1 year ago

@sabinem I do not see that the is a difference in sequence between the two file.

The second data-row in px-x-0602000000_107 looks like

2.2888 2.2439 2.5333 2.082 2.2635 2.4554 2.8023 2.2033 2.1438 2.2471 2.23 2.323 2.3587 2.9314 1.6479 1.7309 1.851 1.3777 1.3832 2.6422 2.5221 2.6557 2.6651 2.5034 2.4052 2.5022 2.5648 2.5081 2.5018 2.6107 2.8766 2.6604 2.6642 2.7725 2.9098 2.6207 2.7838 2.8819 2.6302 2.5639 2.7918 2.7287 2.7589 2.8191 2.9201 2.9279 3.1763 2.9465 2.8385 3.3334 3.0256 3.1187 2.643 3.22 2.9707 2.8139 3.1458 3.0643 3.3021 3.1648 3.5448 3.2656 3.0674 3.0846 2.6731 3.7119 3.8922 3.4605 3.6662 3.1666 2.9895 2.9317 3.0249 2.7296 2.9988 2.9538 2.7161

This corresponds to

Economic division = 5-96 Total
Educational level = University degree
Difficulties = Staff found without difficulties
Weight = Weighted according to the number of businesses  <= Note! This is the second value for the last variable

This is the same for px-x-0702000000_101 where the second data-row looks like

34368 34775 35085 35611 36335 36716 37325 38044 38841 39344 40239 40613 41434 42214 43993 45034 45431 45911 46406 47126 48146 48817 49239 "..." "..." "..." 55951 64242 69954 74107 "..." 

That corresponds to

Observation unit = Farmholdings
Canton = Switzerland
Area of agricultural Production = Area - total
Size class UAA = Size class - total
Farmholding system = Farmholding system - total
Farholding form = Full-time farm <= Note! This is the second value for the last variable

I can not see the difference that you mention. The sequencing between the two files appears to be identical.

sabinem commented 1 year ago

@likp You are right. It seems that I got this wrong.

In the first file px-x-0602000000_107 for example: there is:

STUB[en]="Economic division","Education level","Difficulties","Weight"; and HEADING="Quartal";

Therefore I exploded the dimensions as "Quartal","Economic division","Education level","Difficulties","Weight" But I should have exploded them as Economic division","Education level","Difficulties",,"Weight","Quartal" and then iterated through the inner most dimensions before rotating through the outermost dimensions.

This is probably general statistical knowledge: I was looking for an explanation on this, and if you visualize the tables it gets clear:

Screenshot 2023-08-23 at 17 17 36

Therefore the bug was rather in my parser and not on your side. Which is good. Very happy about this. You can close this issue now. And thanks a lot for helping me understand. :)

likp commented 1 year ago

I am glad I could help.