With a timestamps keyword argument, the NVSPL Accessor should take an iterable of pandas Timestamps for the seconds (as in rows) to pull out of an NVSPL Endpoint.
This will allow for the long-wanted ability to easily cross-reference SRCID back to NVSPL, among other things.
Questions
How should the times be specified?
i) Exact timestamps wanted, down to the second. So to get data from 14:20 to 14:40 on 2015-06-01, you'd give a list of the 1200 Timestamps for each second from 2015-06-01 14:20:00 to 2015-06-01 14:40:00 (not too hard with pd.date_range("2015-06-01 14:20:00", "2015-06-01 14:40:00", freq= "S")). For a lot of timestamps, this is both annoying and possibly not performant, but it's the most specific and maybe easiest to implement. All the other options may simplify to this one in implementation anyway?
ii) List of desired ranges, as a list of tuples of (start, stop), where all data between the two Timestamps, inclusive, is returned. Probably easier to use, unless you want a whole lot of single non-contiguous seconds from the NVSPL.
iii) List of desired ranges, as a list of tuples of (start, duration), where start is a Timestamp and duration is a TimeDelta. Supporting both ii and iii should be trivial. This version is really just a gimmie to the SRCID-NVSPL task, since you could pass in the Series srcid.len with zero conversion.
Is it possible to find a specific time in an NVSPL file without reading every row?
For full-hour files (3200 data rows), yes. But can you tell if a file is a full hour just from the filesize, or would you have to (at a minimum) read the first and last row? And even that would just give a fastpath for 3200-row files (using pd.read_csv(skiprows= ...)); how would you optimize non-continuous files since there's no filter parameter for read_csv?
With a
timestamps
keyword argument, the NVSPL Accessor should take an iterable of pandas Timestamps for the seconds (as in rows) to pull out of an NVSPL Endpoint.This will allow for the long-wanted ability to easily cross-reference SRCID back to NVSPL, among other things.
Questions
How should the times be specified?
i) Exact timestamps wanted, down to the second. So to get data from 14:20 to 14:40 on 2015-06-01, you'd give a list of the 1200 Timestamps for each second from 2015-06-01 14:20:00 to 2015-06-01 14:40:00 (not too hard with
pd.date_range("2015-06-01 14:20:00", "2015-06-01 14:40:00", freq= "S")
). For a lot of timestamps, this is both annoying and possibly not performant, but it's the most specific and maybe easiest to implement. All the other options may simplify to this one in implementation anyway?ii) List of desired ranges, as a list of tuples of
(start, stop)
, where all data between the two Timestamps, inclusive, is returned. Probably easier to use, unless you want a whole lot of single non-contiguous seconds from the NVSPL.iii) List of desired ranges, as a list of tuples of
(start, duration)
, wherestart
is a Timestamp andduration
is a TimeDelta. Supporting both ii and iii should be trivial. This version is really just a gimmie to the SRCID-NVSPL task, since you could pass in the Seriessrcid.len
with zero conversion.Is it possible to find a specific time in an NVSPL file without reading every row?
For full-hour files (3200 data rows), yes. But can you tell if a file is a full hour just from the filesize, or would you have to (at a minimum) read the first and last row? And even that would just give a fastpath for 3200-row files (using
pd.read_csv(skiprows= ...)
); how would you optimize non-continuous files since there's no filter parameter forread_csv
?