mlbelobraydi / TXRRC_data_harvest

Script for accessing and organizing oil and gas well data from the Texas Railroad Commission
The Unlicense
32 stars 18 forks source link

Added pic_signed field type #9

Closed skylerbast closed 4 years ago

skylerbast commented 4 years ago

I've added a function to convert signed fields to Python floats.

Signed fields (PIC SX(VX))are how the Lat / Long coordinates are stored in fields like WB-WGS84-LATITUDE, for example. The reason that the current conversion in TXRRC DataFormats.ipynb is losing sig figs is because it is converting everything to ASCII first. This mangles the last digit, because the last digit is stored not as the EBCDIC codepage representation of the number -- as the rest of the digits are -- but as the binary representation of the digit's value, which is then OR'ed with 0xD0 if the number is negative, 0xC0 if the number is positive, or 0xF0 if the number is unsigned (see here for more detail). The implementation in this PR will convert the EBCDIC values to floats correctly.

The caveat to this is that the values must be passed in exactly as they are encoded in the source file. It looks like currently, everything is re-encoded to ASCII before being parsed. I believe it is still possible to recover the full signed value, but it is definitely more complicated than just converting from EBCDIC. In my opinion, it's generally better parse everything as EBCDIC and only change the encoding at the end of the process. This is especially true when dealing with packed fields; the Wellbore data set does not use them as far as I can tell, but a number of other TXRRC datasets do, so it may be worth considering for extensibility's sake. That said, I haven't made any changes to that effect in this PR -- all I've done is add the function to convert a signed field to a float.

It should work as follows -- using the same first few fields as the data format (just re-entered as byte strings for convenience): > convert_signed(b'\xF0\xF1\xF3\xF6\xF6\xF8\xF1\xF9\xF0\xC0', 7) > 13.66819 > convert_signed(b'\xF1\xF0\xF2\xF3\xF6\xF8\xF5\xF6\xF0\xC1', 7) > 102.3685601