Open benjello opened 9 years ago
Because we want the line 2 in the excel file to be used for column labels, I think we need to write header = 1 because Python starts indexing from 0. Am I wrong?
On Sun, Aug 23, 2015 at 6:00 PM, Mahdi Ben Jelloul <notifications@github.com
wrote:
@antoinearnoud https://github.com/antoinearnoud should not it be be header = 2 instead of header = 1 on this line https://github.com/taxipp/ipp-macro-series-parser/blob/master/ipp_macro_series_parser/comptes_nationaux/cn_parser_non_tee.py#L29 ?
— Reply to this email directly or view it on GitHub https://github.com/taxipp/ipp-macro-series-parser/issues/26.
For me the header should be the 3rd line (the one with the years) but I might be wrong. I am dealing with the file t_7501.xls But there might be another issue with the index. pandas.read_excel do return an multindex composed of the first two columns of the excel file. May be we should be more explicit for index cols etc
ok, sorry I thought we were dealing with TEE files. I think the code was written according to the following:. it parses the table without caring about the column labels (header), that's why header = 1 (but you could write header=2). Then it drops lines where there is no character in column A of the excel file (which correspond to the column 'code' of the df), hence, the line with the dates is deleted. The dates are added after in _dftidy method: since each file has the data fro 1949 to the year of the folder (folder_year), the dates are added artificially (not read from the excel file). Maybe this is not the cleanest way to do. Does it make sense?
On Sun, Aug 23, 2015 at 6:32 PM, Mahdi Ben Jelloul <notifications@github.com
wrote:
For me the header should be the 3rd line (the one with the years) but I might be wrong. I am dealing with the file t_7501.xls But there might be another issue with the index. pandas.read_excel do return an multindex composed of the first two columns of the excel file. May be we should be more explicit for index cols etc
— Reply to this email directly or view it on GitHub https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133875117 .
OK i got it: Could you update and run the tests on your computer to tell me if everything is ok ?
On Sun, Aug 23, 2015 at 6:58 PM antoinearnoud notifications@github.com wrote:
ok, sorry I thought we were dealing with TEE files. I think the code was written according to the following:. it parses the table without caring about the column labels (header), that's why header = 1 (but you could write header=2). Then it drops lines where there is no character in column A of the excel file (which correspond to the column 'code' of the df), hence, the line with the dates is deleted. The dates are added after in _dftidy method: since each file has the data fro 1949 to the year of the folder (folder_year), the dates are added artificially (not read from the excel file). Maybe this is not the cleanest way to do. Does it make sense?
On Sun, Aug 23, 2015 at 6:32 PM, Mahdi Ben Jelloul < notifications@github.com
wrote:
For me the header should be the 3rd line (the one with the years) but I might be wrong. I am dealing with the file t_7501.xls But there might be another issue with the index. pandas.read_excel do return an multindex composed of the first two columns of the excel file. May be we should be more explicit for index cols etc
— Reply to this email directly or view it on GitHub < https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133875117
.
— Reply to this email directly or view it on GitHub https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133878622 .
tests fail. picture attached.
On Sun, Aug 23, 2015 at 7:01 PM, Mahdi Ben Jelloul <notifications@github.com
wrote:
OK i got it: Could you update and run the tests on your computer to tell me if everything is ok ?
On Sun, Aug 23, 2015 at 6:58 PM antoinearnoud notifications@github.com wrote:
ok, sorry I thought we were dealing with TEE files. I think the code was written according to the following:. it parses the table without caring about the column labels (header), that's why header = 1 (but you could write header=2). Then it drops lines where there is no character in column A of the excel file (which correspond to the column 'code' of the df), hence, the line with the dates is deleted. The dates are added after in _dftidy method: since each file has the data fro 1949 to the year of the folder (folder_year), the dates are added artificially (not read from the excel file). Maybe this is not the cleanest way to do. Does it make sense?
On Sun, Aug 23, 2015 at 6:32 PM, Mahdi Ben Jelloul < notifications@github.com
wrote:
For me the header should be the 3rd line (the one with the years) but I might be wrong. I am dealing with the file t_7501.xls But there might be another issue with the index. pandas.read_excel do return an multindex composed of the first two columns of the excel file. May be we should be more explicit for index cols etc
— Reply to this email directly or view it on GitHub <
https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133875117
.
— Reply to this email directly or view it on GitHub < https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133878622
.
— Reply to this email directly or view it on GitHub https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133879015 .
@antoinearnoud please send me the picture by email to my private inbox because i can't see it here.
I just rebased, and ran the test. I got new errors (2 fails). The message is very long. I attached the end of the message.
On Sun, Aug 23, 2015 at 1:09 PM, Antoine ARNOUD antoine.arnoud@gmail.com wrote:
tests fail. picture attached.
On Sun, Aug 23, 2015 at 7:01 PM, Mahdi Ben Jelloul < notifications@github.com> wrote:
OK i got it: Could you update and run the tests on your computer to tell me if everything is ok ?
On Sun, Aug 23, 2015 at 6:58 PM antoinearnoud notifications@github.com wrote:
ok, sorry I thought we were dealing with TEE files. I think the code was written according to the following:. it parses the table without caring about the column labels (header), that's why header = 1 (but you could write header=2). Then it drops lines where there is no character in column A of the excel file (which correspond to the column 'code' of the df), hence, the line with the dates is deleted. The dates are added after in _dftidy method: since each file has the data fro 1949 to the year of the folder (folder_year), the dates are added artificially (not read from the excel file). Maybe this is not the cleanest way to do. Does it make sense?
On Sun, Aug 23, 2015 at 6:32 PM, Mahdi Ben Jelloul < notifications@github.com
wrote:
For me the header should be the 3rd line (the one with the years) but I might be wrong. I am dealing with the file t_7501.xls But there might be another issue with the index. pandas.read_excel do return an multindex composed of the first two columns of the excel file. May be we should be more explicit for index cols etc
— Reply to this email directly or view it on GitHub <
https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133875117
.
— Reply to this email directly or view it on GitHub < https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133878622
.
— Reply to this email directly or view it on GitHub https://github.com/taxipp/ipp-macro-series-parser/issues/26#issuecomment-133879015 .
@antoinearnoud should not it be be header = 2 instead of header = 1 on this line ?