taxipp / ipp-macro-series-parser

Parser for various series of macroeconomics variables and aggregates
GNU General Public License v3.0
2 stars 7 forks source link

Header index for non TEE files #26

Open benjello opened 9 years ago

benjello commented 9 years ago

@antoinearnoud should not it be be header = 2 instead of header = 1 on this line ?

antoinearnoud commented 9 years ago

Because we want the line 2 in the excel file to be used for column labels, I think we need to write header = 1 because Python starts indexing from 0. Am I wrong?

On Sun, Aug 23, 2015 at 6:00 PM, Mahdi Ben Jelloul <notifications@github.com

wrote:

@antoinearnoud https://github.com/antoinearnoud should not it be be header = 2 instead of header = 1 on this line https://github.com/taxipp/ipp-macro-series-parser/blob/master/ipp_macro_series_parser/comptes_nationaux/cn_parser_non_tee.py#L29 ?

— Reply to this email directly or view it on GitHub https://github.com/taxipp/ipp-macro-series-parser/issues/26.

benjello commented 9 years ago

For me the header should be the 3rd line (the one with the years) but I might be wrong. I am dealing with the file t_7501.xls But there might be another issue with the index. pandas.read_excel do return an multindex composed of the first two columns of the excel file. May be we should be more explicit for index cols etc

antoinearnoud commented 9 years ago

ok, sorry I thought we were dealing with TEE files. I think the code was written according to the following:. it parses the table without caring about the column labels (header), that's why header = 1 (but you could write header=2). Then it drops lines where there is no character in column A of the excel file (which correspond to the column 'code' of the df), hence, the line with the dates is deleted. The dates are added after in _dftidy method: since each file has the data fro 1949 to the year of the folder (folder_year), the dates are added artificially (not read from the excel file). Maybe this is not the cleanest way to do. Does it make sense?

On Sun, Aug 23, 2015 at 6:32 PM, Mahdi Ben Jelloul <notifications@github.com

wrote:

For me the header should be the 3rd line (the one with the years) but I might be wrong. I am dealing with the file t_7501.xls But there might be another issue with the index. pandas.read_excel do return an multindex composed of the first two columns of the excel file. May be we should be more explicit for index cols etc

— Reply to this email directly or view it on GitHub https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133875117 .

benjello commented 9 years ago

OK i got it: Could you update and run the tests on your computer to tell me if everything is ok ?

On Sun, Aug 23, 2015 at 6:58 PM antoinearnoud notifications@github.com wrote:

ok, sorry I thought we were dealing with TEE files. I think the code was written according to the following:. it parses the table without caring about the column labels (header), that's why header = 1 (but you could write header=2). Then it drops lines where there is no character in column A of the excel file (which correspond to the column 'code' of the df), hence, the line with the dates is deleted. The dates are added after in _dftidy method: since each file has the data fro 1949 to the year of the folder (folder_year), the dates are added artificially (not read from the excel file). Maybe this is not the cleanest way to do. Does it make sense?

On Sun, Aug 23, 2015 at 6:32 PM, Mahdi Ben Jelloul < notifications@github.com

wrote:

For me the header should be the 3rd line (the one with the years) but I might be wrong. I am dealing with the file t_7501.xls But there might be another issue with the index. pandas.read_excel do return an multindex composed of the first two columns of the excel file. May be we should be more explicit for index cols etc

— Reply to this email directly or view it on GitHub < https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133875117

.

— Reply to this email directly or view it on GitHub https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133878622 .

antoinearnoud commented 9 years ago

tests fail. picture attached.

On Sun, Aug 23, 2015 at 7:01 PM, Mahdi Ben Jelloul <notifications@github.com

wrote:

OK i got it: Could you update and run the tests on your computer to tell me if everything is ok ?

On Sun, Aug 23, 2015 at 6:58 PM antoinearnoud notifications@github.com wrote:

ok, sorry I thought we were dealing with TEE files. I think the code was written according to the following:. it parses the table without caring about the column labels (header), that's why header = 1 (but you could write header=2). Then it drops lines where there is no character in column A of the excel file (which correspond to the column 'code' of the df), hence, the line with the dates is deleted. The dates are added after in _dftidy method: since each file has the data fro 1949 to the year of the folder (folder_year), the dates are added artificially (not read from the excel file). Maybe this is not the cleanest way to do. Does it make sense?

On Sun, Aug 23, 2015 at 6:32 PM, Mahdi Ben Jelloul < notifications@github.com

wrote:

For me the header should be the 3rd line (the one with the years) but I might be wrong. I am dealing with the file t_7501.xls But there might be another issue with the index. pandas.read_excel do return an multindex composed of the first two columns of the excel file. May be we should be more explicit for index cols etc

— Reply to this email directly or view it on GitHub <

https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133875117

.

— Reply to this email directly or view it on GitHub < https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133878622

.

— Reply to this email directly or view it on GitHub https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133879015 .

benjello commented 9 years ago

@antoinearnoud please send me the picture by email to my private inbox because i can't see it here.

antoinearnoud commented 9 years ago

I just rebased, and ran the test. I got new errors (2 fails). The message is very long. I attached the end of the message.

On Sun, Aug 23, 2015 at 1:09 PM, Antoine ARNOUD antoine.arnoud@gmail.com wrote:

tests fail. picture attached.

On Sun, Aug 23, 2015 at 7:01 PM, Mahdi Ben Jelloul < notifications@github.com> wrote:

OK i got it: Could you update and run the tests on your computer to tell me if everything is ok ?

On Sun, Aug 23, 2015 at 6:58 PM antoinearnoud notifications@github.com wrote:

ok, sorry I thought we were dealing with TEE files. I think the code was written according to the following:. it parses the table without caring about the column labels (header), that's why header = 1 (but you could write header=2). Then it drops lines where there is no character in column A of the excel file (which correspond to the column 'code' of the df), hence, the line with the dates is deleted. The dates are added after in _dftidy method: since each file has the data fro 1949 to the year of the folder (folder_year), the dates are added artificially (not read from the excel file). Maybe this is not the cleanest way to do. Does it make sense?

On Sun, Aug 23, 2015 at 6:32 PM, Mahdi Ben Jelloul < notifications@github.com

wrote:

For me the header should be the 3rd line (the one with the years) but I might be wrong. I am dealing with the file t_7501.xls But there might be another issue with the index. pandas.read_excel do return an multindex composed of the first two columns of the excel file. May be we should be more explicit for index cols etc

— Reply to this email directly or view it on GitHub <

https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133875117

.

— Reply to this email directly or view it on GitHub < https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133878622

.

— Reply to this email directly or view it on GitHub https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133879015 .