kinverarity1 / lasio

Python library for reading and writing well data using Log ASCII Standard (LAS) files
https://lasio.readthedocs.io/en/latest/
MIT License
344 stars 151 forks source link

Parse dates to datetime objects #1

Closed kinverarity1 closed 2 years ago

kinverarity1 commented 10 years ago

The 3.0 specification describes a datetime format code (p. 24). I need to implement it.

VelizarVESSELINOV commented 9 years ago

I have also a LAS file version 2.0 that has datetime channels inside, so I hope you will have similar to Python from __future__ import functionality to make available in LAS 2.0 functionality defined for LAS 3.0.

~VERSION INFORMATION 
VERS  .    2.0                                     :CWLS Log ASCII Standard - VERSION 2.0
...
~CURVE INFORMATION
#MNEM           .UNIT                  API CODE            :DESCRIPTION
#----            ------          --------------            -----------------------------
TIME_1900       .d                                         :                                                        Time Index(OLE Automation date)
TIME            .s                                         :                                (1s)                    Time(hh mm ss/dd-MMM-yyyy)
...
41725.9438268634 22:39:06/27-Mar-2014 ...
kinverarity1 commented 5 years ago

It's about time to support datetime and/or timestamps in the data section for LAS <= 2 files. At the moment the data array is a numpy.ndarray with a common data type, so it won't work. Here are the options that I see:

  1. Use a structured ndarray with dtypes specified. Or a record array, with curve mnemonics as keys.
  2. Require pandas and use a DataFrame

Option 1

We would need to read the datetimes and timestamps as a float

https://github.com/kinverarity1/lasio/blob/692bc590476f93710f62f7ce9bbe776b65e63c88/lasio/reader.py#L353-L366

Then after the array is reshaped back in LASFile.read(), create the structured ndarray:

https://github.com/kinverarity1/lasio/blob/692bc590476f93710f62f7ce9bbe776b65e63c88/lasio/las.py#L225-L239

We would have to keep track of which column is datetime/timestamped or not.

Update: More realistic would be re-doing the first function above so that it knows what dtype should be expected in which column. Somehow we have to support wrapped files. To do that, the read_file_contents function would have to fully parse the Curves section(s) before tackling the data section(s):

https://github.com/kinverarity1/lasio/blob/f369cc050bacf133354d0d1ab06f2f93f47d699c/lasio/reader.py#L224

Option 2

Obviously my preference is for option 2 😄

Update: pandas would struggle, I suspect, with wrapped files, which I'd prefer to support with the same code as unwrapped.

dagrha commented 5 years ago

I like the idea of leveraging pandas for the LASv2+ support. But it appears that you are right about pandas struggling with wrapped files. There doesn't seem to be any pandas built-in solution for reading rows that span multiple newlines (see this S.O. post for example).

Just spitballing here, but would it make any sense to have some simple heuristic based on the first few lines of the ~A section to determine if it's wrapped or not? Then the logic might be: if it's not wrapped it could go directly to pandas via read_csv. If it's wrapped, the data section could be "unwrapped" then sent to pandas.

Maybe that's more of a rewrite than you'd want to do, and it's unclear if the benefits (e.g. pandas handles datatypes) would outweigh the issues that arise (for example I'm not sure how pandas could do what you do with the READ_SUBS to handle malformed data sections).

kinverarity1 commented 5 years ago

Yeah...I think either way it's a biggish job. I'm warming to using a record array. It's a good chance to do some of the LAS 3 stuff like reading multiple data sections, and dealing with comma delimited data sections too.

Plus it might allow solving #227

The tricky part is keeping the memory/speed usage as it is now.

dagrha commented 5 years ago

This may not be the best spot to put this comment, but just to follow up on the "to pandas or not to pandas" question, this weekend I played around a bit with adding a pandas engine for parsing the data section.

Here are some benchmarks on a 28MB (unwrapped) las file, comparing the default parser and this pandas one I kluged in:

default lasio parsing

%timeit las = lasio.read('example_28MB_file.las')
4.93 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%memit las = lasio.read('example_28MB_file.las')
peak memory: 182.38 MiB, increment: 98.73 MiB

pandas parsing

%timeit las = lasio.read('example_28MB_file.las', engine='pandas')
347 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%memit las = lasio.read('example_28MB_file.las', engine='pandas')
peak memory: 112.80 MiB, increment: 28.46 MiB

Admittedly this code is not production ready, wouldn't pass all tests, and doesn't deal with wrapped files!

But this basic test on my unoptimized code to read the data section with pandas read_table and convert it to a 1-D array (as the default parser does) shows some promising gains in speed (>10x) and memory usage.

kinverarity1 commented 5 years ago

Thanks. That is attractive. lasio is already much too slow. I am not sure that the benefits of all the substitution code outweigh the performance benefits, given that so many files are unwrapped.

@ahjulstad submitted a great PR (#149) ages ago which went down this route, but I did not merge it because it came before a major refactor of how the reader &c was set up. And I was being precious about not requiring pandas. Perhaps we should get that PR up to date - or use your engine - and then implement the unwrapped reader last.

An issue that needs doing before we start: rework the overall reading function so that all the header sections are fully parsed before even touching any of the data sections. That way we know whether they are wrapped or not, whether any columns can be expected to be non-numeric, and so on, when parsing the data section. (TLDR: fix my horrendous LASFile.read() method).

And, if we have separate code for parsing wrapped and unwrapped data: all tests featuring the data section need to be duplicated for both wrapped and unwrapped.

VelizarVESSELINOV commented 4 years ago

+1 for pandas.read_csv as default engine for non wrapped files.

Reasons:

  1. I like the speed performance
  2. wrapped LAS are rare and often small size no big performance issue
  3. I like the date/time and string management of pandas
  4. plus no more bugs like NULL not working after DateTime column (#261)

If needed for harmonization, I think it is possible to write "unwrapper" for wrapped LAS files and after that use the same read_csv function.

kinverarity1 commented 2 years ago

This has basically been implemented now in v0.30:

https://lasio.readthedocs.io/en/latest/data-section.html#handling-text-dates-timestamps-or-any-non-numeric-characters