Open redapple opened 7 years ago
@vshlapakov , @chekunkov , any toughts on this?
As far as I understand you want this to happen transparently - if we know that endpoint has a timestamp field somewhere in the response we should covert it to datetime object. Several problems with that:
What I can propose is to provide and document some utility functions that will convert timestamps in milliseconds to/from datetime objects. This way we will both keep the client implementation simple and decoupled from the backend response format + provide a handy way to convert values to datetime objects for users.
Regards accepting datetime objects as input parameters for client methods - it's feasible, I don't see any problems with that.
dt
suffix, like tsdt
that would give you the timestamp field converted.I'm not particularly fond of adding an extra layer to process timestamps.
- I don't understand. Can you provide examples?
We will need to add some complex response traversal rules to determine wheter value for the given field should be changed. If we do that simply by matching field name - we are risking to convert values defined by user, e.g. in spider args or in collections or in items. This check will negatively affect performance. Also I'm not sure it even worth it - I cannot recall any api client that does such conversion.
→ curl -u $SH_APIKEY: https://storage.scrapinghub.com/jobs/108106/13/2879/pending_time
1489416035850
→ curl -u $SH_APIKEY: https://storage.scrapinghub.com/jobs/108106/13/2879
{"_shub_worker": "k01-kumo1449", "priority": 2, "units": 1, "close_reason": "finished", "completed_by": "jobrunner", "scheduled_by": "alex", "finished_time": 1489416111404, "running_time": 1489416077918, "state": "finished", "spider_args": {"time": 1489416035.8026829}, "project": 108106, "started_by": "jobrunner", "scrapystats": {"item_scraped_count": 10000, "log_count/ERROR": 200, "response_received_count": 10000, "scheduler/dequeued": 10000, "scheduler/enqueued": 10000}, "pending_time": 1489416035850, "spider": "pipe-spider"}
→ curl -u $SH_APIKEY: https://storage.scrapinghub.com/jobq/108106/list
{"running_time": 1489416077918, "spider": "pipe-spider", "finished_time": 1489416111404, "close_reason": "finished", "state": "finished", "pending_time": 1489416035850, "ts": 1489416110749, "elapsed": 3713752302, "key": "108106/13/2879", "items": 10000, "logs": 10002, "errors": 200, "pages": 10000}
{"running_time": 1489416078377, "spider": "pipe-spider", "finished_time": 1489416111292, "close_reason": "finished", "state": "finished", "pending_time": 1489416036066, "ts": 1489416109730, "elapsed": 3713753321, "key": "108106/13/2880", "items": 10000, "logs": 10002, "errors": 200, "pages": 10000}
...
→ curl -u $SH_APIKEY: https://storage.scrapinghub.com/jobq/108106/summary
{"name": "pending", "count": 0, "summary": []}
{"name": "running", "count": 0, "summary": []}
{"name": "finished", "count": 2729, "summary": [{"close_reason": "finished", "finished_time": 1489416111404, "running_time": 1489416077918, "pending_time": 1489416035850, "spider": "pipe-spider", "state": "finished", "ts": 1489416110749, "elapsed": 3713909479, "key": "108106/13/2879", "items": 10000, "errors": 200, "logs": 10002, "pages": 10000}, {"close_reason": "finished", "finished_time": 1489416111292, "running_time": 1489416078377, "pending_time": 1489416036066, "spider": "pipe-spider", "state": "finished", "ts": 1489416109730, "elapsed": 3713910498, "key": "108106/13/2880", "items": 10000, "errors": 200, "logs": 10002, "pages": 10000}, {"close_reason": "finished", "finished_time": 1489416110060, "running_time": 1489416078590, "pending_time": 1489416036224, "spider": "pipe-spider", "state": "finished", "ts": 1489416109636, "elapsed": 3713910592, "key": "108106/13/2881", "items": 10000, "errors": 200, "logs": 10002, "pages": 10000}, {"close_reason": "finished", "finished_time": 1489416107841, "running_time": 1489416077648, "pending_time": 1489416035675, "spider": "pipe-spider", "state": "finished", "ts": 1489416106347, "elapsed": 3713913881, "key": "108106/13/2878", "items": 10000, "errors": 200, "logs": 10002, "pages": 10000}, {"close_reason": "finished", "finished_time": 1489416107473, "running_time": 1489416070932, "pending_time": 1489416035542, "spider": "pipe-spider", "state": "finished", "ts": 1489416106980, "elapsed": 3713913248, "key": "108106/13/2877", "items": 10000, "errors": 200, "logs": 10002, "pages": 10000}]}
→ curl -u $SH_APIKEY: https://storage.scrapinghub.com/spiders/108106/lastjobsummary
{"finished_time": 1486731896687, "running_time": 1486731883336, "state": "finished", "spider": "thriftpy-spider", "close_reason": "finished", "pending_time": 1486731880832, "ts": 1486731896102, "elapsed": 6398980729, "key": "108106/1/3815", "errors": 200, "logs": 10002, "items": 10000, "pages": 10000}
{"finished_time": 1479236157521, "running_time": 1479236152992, "state": "deleted", "spider": "localinfo", "close_reason": "finished", "pending_time": 1479236148322, "version": "5708615-update-reqs", "ts": 1479236156947, "elapsed": 13894719884, "key": "108106/6/3", "logs": 19, "items": 1, "pages": 1}
{"finished_time": 1489416111404, "running_time": 1489416077918, "state": "finished", "spider": "pipe-spider", "close_reason": "finished", "pending_time": 1489416035850, "ts": 1489416110749, "elapsed": 3714766082, "key": "108106/13/2879", "errors": 200, "logs": 10002, "items": 10000, "pages": 10000}
{"finished_time": 1480428359334, "running_time": 1480428342586, "state": "deleted", "spider": "amazon-spider", "close_reason": "finished", "pending_time": 1480428342340, "version": "4e615e9-satellite-add-example-from-blog-post", "ts": 1480428358901, "elapsed": 12702517930, "key": "108106/14/5", "logs": 2, "items": 2}
{"finished_time": 1483732348802, "running_time": 1483732309612, "state": "finished", "spider": "python-client-spider", "close_reason": "finished", "pending_time": 1483732151279, "ts": 1483732345152, "elapsed": 9398531679, "key": "108106/15/196", "errors": 200, "logs": 10002, "items": 10000, "pages": 10000}
- client and backend will need to be in sync regardless of this feature. If fields are updated and/or added, the API would use a higher version imo.
If we add new fields to the backend response we will not increase api version - at least with our current api versioning approach. And we don't want python client to be tightly coupled with backend response format, this will significantly increase cost of backend changes. Right now we simply decode json, json lines or msgpack response and I think that's what we should probably stick to.
I do see "finished_time"
, "running_time"
, "pending_time"
and "ts"
in many of your examples.
And we don't want python client to be tightly coupled with backend response format, this will significantly increase cost of backend changes.
The python client then becomes a very thin layer on top of the API's JSON responses. There are tools to generate API clients from a schema with types. This could be use to keep the python client in sync with the API responses.
This check will negatively affect performance.
Do we know that in advance?
Also I'm not sure it even worth it - I cannot recall any api client that does such conversion.
Other clients not doing it is not an argument for me. I don't understand why the python client should not try to make the lives of users a bit easier.
IMO it's a minor improvement that requires major changes - potential advantages that transparent conversion to datatime objects has doesn't justify additional complexity it will bring to the client code, maintenance, we can spend developers time on something more important.
Alright, I'll keep this open as nice to have then.
I do see "finished_time", "running_time", "pending_time" and "ts" in many of your examples.
I forgot add a comment on this. Yes, technically we can recursively check response for the given fields + check request url for endpoints like /jobs/108106/13/2879/pending_time
, but as I mentioned before my main concerns are performance (maybe not significant, we will measure it for sure if we select this for development), lack of support for new timestamp fields if we add some in the old clients, potential false positive conversions when client will convert user defined fields with the same names - it's possible to have such field names in spider arguments, items, collections - to avoid that we will need to define a set of endpoints or fields inside the responses that shouldn't be converted. That brings the complexity I was talking about and that's why I don't like this idea.
Several Scrapinghub API endpoints accept or return timestamps, currently as UNIX timestamp in milliseconds.
It would be great to have those values as
datetime.datetime
objects in the results so that consumers of python-scrapinghub calls do not have to convert them for interpretation.Passing
datetime.datetime
objects in methods allowing filtering on timestamps, e.g. wherestartts
andendts
arguments are supported, would be very handy too.