scrapinghub / python-scrapinghub

A client interface for Scrapinghub's API
https://python-scrapinghub.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
202 stars 63 forks source link

Return timestamp values as datetime objects #67

Open redapple opened 7 years ago

redapple commented 7 years ago

Several Scrapinghub API endpoints accept or return timestamps, currently as UNIX timestamp in milliseconds.

It would be great to have those values as datetime.datetime objects in the results so that consumers of python-scrapinghub calls do not have to convert them for interpretation.

Passing datetime.datetime objects in methods allowing filtering on timestamps, e.g. where startts and endts arguments are supported, would be very handy too.

redapple commented 7 years ago

@vshlapakov , @chekunkov , any toughts on this?

chekunkov commented 7 years ago

As far as I understand you want this to happen transparently - if we know that endpoint has a timestamp field somewhere in the response we should covert it to datetime object. Several problems with that:

  1. This breaks backwards compatibility
  2. Requires us to keep client and backend in sync - if we add/remove/move fields we will need to update the client and old client will not be compatible with new response
  3. Timestamp field can be the value that is returned for the call - backend supports requests where you can get value for job meta key, so the entire response is the value we need to convert, or it can be a part of object, or a part of object in the list, or arbitrarily nested for some specific endpoints
  4. If users for some reason follow SH convention to store some custom values as timestamp in milliseconds - client won't convert that automatically.

What I can propose is to provide and document some utility functions that will convert timestamps in milliseconds to/from datetime objects. This way we will both keep the client implementation simple and decoupled from the backend response format + provide a handy way to convert values to datetime objects for users.

Regards accepting datetime objects as input parameters for client methods - it's feasible, I don't see any problems with that.

redapple commented 7 years ago
  1. To keep backward compatibility, one could imagine new fields with a dt suffix, like tsdt that would give you the timestamp field converted.
  2. client and backend will need to be in sync regardless of this feature. If fields are updated and/or added, the API would use a higher version imo.
  3. I don't understand. Can you provide examples?
  4. I don't think this will happen often, nor that users would expect the python library to adapt to their custom fields.

I'm not particularly fond of adding an extra layer to process timestamps.

chekunkov commented 7 years ago
  1. I don't understand. Can you provide examples?

We will need to add some complex response traversal rules to determine wheter value for the given field should be changed. If we do that simply by matching field name - we are risking to convert values defined by user, e.g. in spider args or in collections or in items. This check will negatively affect performance. Also I'm not sure it even worth it - I cannot recall any api client that does such conversion.

→ curl -u $SH_APIKEY: https://storage.scrapinghub.com/jobs/108106/13/2879/pending_time
1489416035850
→ curl -u $SH_APIKEY: https://storage.scrapinghub.com/jobs/108106/13/2879
{"_shub_worker": "k01-kumo1449", "priority": 2, "units": 1, "close_reason": "finished", "completed_by": "jobrunner", "scheduled_by": "alex", "finished_time": 1489416111404, "running_time": 1489416077918, "state": "finished", "spider_args": {"time": 1489416035.8026829}, "project": 108106, "started_by": "jobrunner", "scrapystats": {"item_scraped_count": 10000, "log_count/ERROR": 200, "response_received_count": 10000, "scheduler/dequeued": 10000, "scheduler/enqueued": 10000}, "pending_time": 1489416035850, "spider": "pipe-spider"}
→ curl -u $SH_APIKEY: https://storage.scrapinghub.com/jobq/108106/list
{"running_time": 1489416077918, "spider": "pipe-spider", "finished_time": 1489416111404, "close_reason": "finished", "state": "finished", "pending_time": 1489416035850, "ts": 1489416110749, "elapsed": 3713752302, "key": "108106/13/2879", "items": 10000, "logs": 10002, "errors": 200, "pages": 10000}
{"running_time": 1489416078377, "spider": "pipe-spider", "finished_time": 1489416111292, "close_reason": "finished", "state": "finished", "pending_time": 1489416036066, "ts": 1489416109730, "elapsed": 3713753321, "key": "108106/13/2880", "items": 10000, "logs": 10002, "errors": 200, "pages": 10000}
...
→ curl -u $SH_APIKEY: https://storage.scrapinghub.com/jobq/108106/summary
{"name": "pending", "count": 0, "summary": []}
{"name": "running", "count": 0, "summary": []}
{"name": "finished", "count": 2729, "summary": [{"close_reason": "finished", "finished_time": 1489416111404, "running_time": 1489416077918, "pending_time": 1489416035850, "spider": "pipe-spider", "state": "finished", "ts": 1489416110749, "elapsed": 3713909479, "key": "108106/13/2879", "items": 10000, "errors": 200, "logs": 10002, "pages": 10000}, {"close_reason": "finished", "finished_time": 1489416111292, "running_time": 1489416078377, "pending_time": 1489416036066, "spider": "pipe-spider", "state": "finished", "ts": 1489416109730, "elapsed": 3713910498, "key": "108106/13/2880", "items": 10000, "errors": 200, "logs": 10002, "pages": 10000}, {"close_reason": "finished", "finished_time": 1489416110060, "running_time": 1489416078590, "pending_time": 1489416036224, "spider": "pipe-spider", "state": "finished", "ts": 1489416109636, "elapsed": 3713910592, "key": "108106/13/2881", "items": 10000, "errors": 200, "logs": 10002, "pages": 10000}, {"close_reason": "finished", "finished_time": 1489416107841, "running_time": 1489416077648, "pending_time": 1489416035675, "spider": "pipe-spider", "state": "finished", "ts": 1489416106347, "elapsed": 3713913881, "key": "108106/13/2878", "items": 10000, "errors": 200, "logs": 10002, "pages": 10000}, {"close_reason": "finished", "finished_time": 1489416107473, "running_time": 1489416070932, "pending_time": 1489416035542, "spider": "pipe-spider", "state": "finished", "ts": 1489416106980, "elapsed": 3713913248, "key": "108106/13/2877", "items": 10000, "errors": 200, "logs": 10002, "pages": 10000}]}
→ curl -u $SH_APIKEY: https://storage.scrapinghub.com/spiders/108106/lastjobsummary
{"finished_time": 1486731896687, "running_time": 1486731883336, "state": "finished", "spider": "thriftpy-spider", "close_reason": "finished", "pending_time": 1486731880832, "ts": 1486731896102, "elapsed": 6398980729, "key": "108106/1/3815", "errors": 200, "logs": 10002, "items": 10000, "pages": 10000}
{"finished_time": 1479236157521, "running_time": 1479236152992, "state": "deleted", "spider": "localinfo", "close_reason": "finished", "pending_time": 1479236148322, "version": "5708615-update-reqs", "ts": 1479236156947, "elapsed": 13894719884, "key": "108106/6/3", "logs": 19, "items": 1, "pages": 1}
{"finished_time": 1489416111404, "running_time": 1489416077918, "state": "finished", "spider": "pipe-spider", "close_reason": "finished", "pending_time": 1489416035850, "ts": 1489416110749, "elapsed": 3714766082, "key": "108106/13/2879", "errors": 200, "logs": 10002, "items": 10000, "pages": 10000}
{"finished_time": 1480428359334, "running_time": 1480428342586, "state": "deleted", "spider": "amazon-spider", "close_reason": "finished", "pending_time": 1480428342340, "version": "4e615e9-satellite-add-example-from-blog-post", "ts": 1480428358901, "elapsed": 12702517930, "key": "108106/14/5", "logs": 2, "items": 2}
{"finished_time": 1483732348802, "running_time": 1483732309612, "state": "finished", "spider": "python-client-spider", "close_reason": "finished", "pending_time": 1483732151279, "ts": 1483732345152, "elapsed": 9398531679, "key": "108106/15/196", "errors": 200, "logs": 10002, "items": 10000, "pages": 10000}
  1. client and backend will need to be in sync regardless of this feature. If fields are updated and/or added, the API would use a higher version imo.

If we add new fields to the backend response we will not increase api version - at least with our current api versioning approach. And we don't want python client to be tightly coupled with backend response format, this will significantly increase cost of backend changes. Right now we simply decode json, json lines or msgpack response and I think that's what we should probably stick to.

redapple commented 7 years ago

I do see "finished_time", "running_time", "pending_time" and "ts" in many of your examples.

And we don't want python client to be tightly coupled with backend response format, this will significantly increase cost of backend changes.

The python client then becomes a very thin layer on top of the API's JSON responses. There are tools to generate API clients from a schema with types. This could be use to keep the python client in sync with the API responses.

This check will negatively affect performance.

Do we know that in advance?

Also I'm not sure it even worth it - I cannot recall any api client that does such conversion.

Other clients not doing it is not an argument for me. I don't understand why the python client should not try to make the lives of users a bit easier.

chekunkov commented 7 years ago

IMO it's a minor improvement that requires major changes - potential advantages that transparent conversion to datatime objects has doesn't justify additional complexity it will bring to the client code, maintenance, we can spend developers time on something more important.

redapple commented 7 years ago

Alright, I'll keep this open as nice to have then.

chekunkov commented 7 years ago

I do see "finished_time", "running_time", "pending_time" and "ts" in many of your examples.

I forgot add a comment on this. Yes, technically we can recursively check response for the given fields + check request url for endpoints like /jobs/108106/13/2879/pending_time, but as I mentioned before my main concerns are performance (maybe not significant, we will measure it for sure if we select this for development), lack of support for new timestamp fields if we add some in the old clients, potential false positive conversions when client will convert user defined fields with the same names - it's possible to have such field names in spider arguments, items, collections - to avoid that we will need to define a set of endpoints or fields inside the responses that shouldn't be converted. That brings the complexity I was talking about and that's why I don't like this idea.