Closed kinverarity1 closed 2 years ago
I have also a LAS file version 2.0 that has datetime channels inside, so I hope you will have similar to Python from __future__ import
functionality to make available in LAS 2.0 functionality defined for LAS 3.0.
~VERSION INFORMATION
VERS . 2.0 :CWLS Log ASCII Standard - VERSION 2.0
...
~CURVE INFORMATION
#MNEM .UNIT API CODE :DESCRIPTION
#---- ------ -------------- -----------------------------
TIME_1900 .d : Time Index(OLE Automation date)
TIME .s : (1s) Time(hh mm ss/dd-MMM-yyyy)
...
41725.9438268634 22:39:06/27-Mar-2014 ...
It's about time to support datetime and/or timestamps in the data section for LAS <= 2 files. At the moment the data array is a numpy.ndarray with a common data type, so it won't work. Here are the options that I see:
We would need to read the datetimes and timestamps as a float
Then after the array is reshaped back in LASFile.read(), create the structured ndarray:
We would have to keep track of which column is datetime/timestamped or not.
Update: More realistic would be re-doing the first function above so that it knows what dtype should be expected in which column. Somehow we have to support wrapped files. To do that, the read_file_contents function would have to fully parse the Curves section(s) before tackling the data section(s):
Obviously my preference is for option 2 😄
Update: pandas would struggle, I suspect, with wrapped files, which I'd prefer to support with the same code as unwrapped.
I like the idea of leveraging pandas for the LASv2+ support. But it appears that you are right about pandas struggling with wrapped files. There doesn't seem to be any pandas built-in solution for reading rows that span multiple newlines (see this S.O. post for example).
Just spitballing here, but would it make any sense to have some simple heuristic based on the first few lines of the ~A section to determine if it's wrapped or not? Then the logic might be: if it's not wrapped it could go directly to pandas via read_csv. If it's wrapped, the data section could be "unwrapped" then sent to pandas.
Maybe that's more of a rewrite than you'd want to do, and it's unclear if the benefits (e.g. pandas handles datatypes) would outweigh the issues that arise (for example I'm not sure how pandas could do what you do with the READ_SUBS to handle malformed data sections).
Yeah...I think either way it's a biggish job. I'm warming to using a record array. It's a good chance to do some of the LAS 3 stuff like reading multiple data sections, and dealing with comma delimited data sections too.
Plus it might allow solving #227
The tricky part is keeping the memory/speed usage as it is now.
This may not be the best spot to put this comment, but just to follow up on the "to pandas or not to pandas" question, this weekend I played around a bit with adding a pandas engine for parsing the data section.
Here are some benchmarks on a 28MB (unwrapped) las file, comparing the default parser and this pandas one I kluged in:
default lasio parsing
%timeit las = lasio.read('example_28MB_file.las')
4.93 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%memit las = lasio.read('example_28MB_file.las')
peak memory: 182.38 MiB, increment: 98.73 MiB
pandas parsing
%timeit las = lasio.read('example_28MB_file.las', engine='pandas')
347 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%memit las = lasio.read('example_28MB_file.las', engine='pandas')
peak memory: 112.80 MiB, increment: 28.46 MiB
Admittedly this code is not production ready, wouldn't pass all tests, and doesn't deal with wrapped files!
But this basic test on my unoptimized code to read the data section with pandas read_table
and convert it to a 1-D array (as the default parser does) shows some promising gains in speed (>10x) and memory usage.
Thanks. That is attractive. lasio is already much too slow. I am not sure that the benefits of all the substitution code outweigh the performance benefits, given that so many files are unwrapped.
@ahjulstad submitted a great PR (#149) ages ago which went down this route, but I did not merge it because it came before a major refactor of how the reader &c was set up. And I was being precious about not requiring pandas
. Perhaps we should get that PR up to date - or use your engine - and then implement the unwrapped reader last.
An issue that needs doing before we start: rework the overall reading function so that all the header sections are fully parsed before even touching any of the data sections. That way we know whether they are wrapped or not, whether any columns can be expected to be non-numeric, and so on, when parsing the data section. (TLDR: fix my horrendous LASFile.read()
method).
And, if we have separate code for parsing wrapped and unwrapped data: all tests featuring the data section need to be duplicated for both wrapped and unwrapped.
+1 for pandas.read_csv as default engine for non wrapped files.
Reasons:
If needed for harmonization, I think it is possible to write "unwrapper" for wrapped LAS files and after that use the same read_csv function.
This has basically been implemented now in v0.30:
The 3.0 specification describes a datetime format code (p. 24). I need to implement it.