Missing or misaligned table data

amckiern commented 7 years ago

Hi,

I am doing a query on a sas table:

tbl = 'SCRAP'
libref = 'engnr_cl'
where = 'scrap_time >= '"01may2017 00:00"dt'

tempDf = s.sd2df(table=tbl, libref=libref, dsopts={'where': where})

and every row of the _SCRAPTIME column should contain a datetime, however for some reason saspy or pandas is not treating this column as a datetime column - it is treating it as an "object". It's doing the same thing for some other datetime columns also.

When I look at the same time period in Enterprise Guide I can see datetimes in every row.

Here's an excerpt from the log (I cut out some columns as there are a lot of them):

data _null_; 
set engnr_cl.scrap(where=(scrap_time >= "01may2017 00:00"dt) );
file _tomods1; 
put 'AREA'n '09'x 
'LOTID'n '09'x 
'STEP_START_TIME'n E8601DT26.6 '09'x 
'SCRAP_TIME'n E8601DT26.6 '09'x
'STEP_END_TIME'n E8601DT26.6 '09'x 
'Comments'n '09'x
'LOT_TYPE'n '09'x 
'WEEK_NO'n '09'x 
'root_cause_location'n '09'x 
'Elec_Cause'n ;
run;

The log says 59 records were read.

I have tried filling the NaN values in the _SCRAPTIME column with NaT, but it still thinks there are blank strings in some of the cells.

The Comments column contains text and each cell can contain multiple lines of text, probably separated by newline characters (but I'm not sure about that).

I tried the query below with and without the Comments column and it seemed to be okay without it, but the rows looked misaligned with the Comments column included.

s.sd2df('SCRAP', 'ENGNR_CL', {'keep':'LOTID AREA SCRAP_TIME Comments', 'where':'SCRAP_TIME >= "01may2017 00:00"dt'})

In the screenshot below, the AREA column should only contain a single word per cell but instead seems to contain text that should have been in the Comments column.

Any idea how I can get around this?

Thanks.

tomweber-sas commented 7 years ago

yeah, that looks like the parsing is getting messed up. If there are embedded newlines or tabs, that would throw this out of alignment parsing it. Let me reproduce this and see if adding the ability to specify the record and column separator characters would solve this, simply. I'm using newline and tab by default, but these could be specified for cases like this where the data contains these characters. newline and tab are fine as defaults and work for most data, but, as you see here, if they are in the data, that will throw everything off.

FriedEgg commented 7 years ago

Another option would be to avoid the delimited file and use JSON instead. PROC JSON will automatically convert embedded newlines and similar to JSON compliant substitutes, i.e. \n, which would then decode correctly with the json library in Python...

tomweber-sas commented 7 years ago

I've pushed a fix for this to master. I've added the ability to specify column and row separator characters although you should no longer actually need to. I've changed the defaults to make this more robust and verified it on all available platforms; Linux via STDIO, Windos/Linux via IOM, And MVS via IOM.

Can you pull the latest and see if you get clean results with your data? I'll add doc for the new options, but again, I don't think you will actually need them as the default should now work for you.

Thanks, Tom

tomweber-sas commented 7 years ago

@amckiern did this solve your data conversion problem? Are you having any other issues moving data between SAS and python?

Thanks, Tom

amckiern commented 7 years ago

Hi Tom, sorry, I haven't had a chance to try this out until now. It still does not appear to be working with v2.1.5. Do you have a basic example of how I can specify column and row separator characters and I will try that?

Thanks, Adrian.

tomweber-sas commented 7 years ago

Did you pull the code from the repo? I haven't built a new pip install that contains this yet. Both pip and the repo are at 2.1.5, so you can look at the source to see it you really have it. You can look at the signature of sasdata2dataframe in either IO sources (sasioiom.py or sasiostdio.py); the new one will be this:

   def sasdata2dataframe(self, table: str, libref: str ='', dsopts: dict ={}, rowsep: str = '\x01', colsep: str = '\x02', **kwargs) -> '<Pandas Data Frame object>':

which also shows you haw to specify the delimiter characters. I'm using hex 01 and 02. These are non-printable characters that wouldn't be in character data normally. They also are the same in both ASCII and EBCDIC. I tried this with newlines and tabs in a character column and it had no problem parsing the data. You can use the '\xNN' or, for characters you can type you can just code them like 'z'. I also tried this w/ 'x' and 'y' and that worked fine too (I didn't have x or y in the data but had newlines and tabs still).

Can you double check that you have this source? Thanks, Tom

amckiern commented 7 years ago

I pulled it using pip (pip install saspy==2.1.5) and it didn't seem to have the updates you mentioned so I copied it from the repo and now it's working! Thanks.

tomweber-sas commented 7 years ago

Great! Thanks

sassoftware / saspy

Missing or misaligned table data #37