stac-utils / pystac-client

Python client for searching STAC APIs
https://pystac-client.readthedocs.io
Other
156 stars 48 forks source link

Simple Catalog search.items() goes on infinitely #617

Closed iliion closed 10 months ago

iliion commented 10 months ago

pystac_client version: 0.7.5

I am performing the following simple request to get some items from a catalog and this ends up in an infinite loop (?).

from pystac_client import Client
import datetime

def main():
    catalog = Client.open(url='https://earth-search.aws.element84.com/v1/')
    my_search = catalog.search(collections='cop-dem-glo-30', limit = 5)
    print(my_search.url_with_parameters())
    # prints out -> `https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30`
    for item in my_search.items():
        print(item)

if __name__ == '__main__':
    main()

In the above example I would just expect to the api to return 5 items per page. What I get instead are multiple requests of the following https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30. In addtion if the results are less than the limit imposed, then the api will keep returning repeatedly the same items (and not necessarilty in the same order).

TomAugspurger commented 10 months ago

I think you want max_items=5. limit comes from the STAC API spec and controls the number of items per page.

On Tue, Nov 21, 2023 at 9:06 AM iliion @.***> wrote:

pystac_client version: 0.7.5

I am performing the following simple request to get some items from a catalog and this ends up in an infinite loop (?).

from pystac_client import Client import datetime

def main(): catalog = Client.open(url='https://earth-search.aws.element84.com/v1/') my_search = catalog.search(collections='cop-dem-glo-30', limit = 5) print(my_search.url_with_parameters())

prints out -> https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30 https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30

for item in my_search.items():
    print(item)

if name == 'main': main()

In the above example I would just expect to the api to return 5 items per page. What I get instead are multiple requests of the following https://earth-search.aws.element84.com/v1/search?limit=5&collections=cop-dem-glo-30 . In addtion if the results are less than the limit imposed, then the api will keep returning repeatedly the same items (and not necessarilty in the same order).

— Reply to this email directly, view it on GitHub https://github.com/stac-utils/pystac-client/issues/617 or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIRLDRJZTYWJTAX733DYFS7PTBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJLJONZXKZNENZQW2ZNLORUHEZLBMRPXI6LQMWBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTLDTOVRGUZLDORPXI6LQMWSUS43TOVS2M5DPOBUWG44SQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJTGQZTSOJQGUYTTAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDEMBQGQ2DSMRSGM22O5DSNFTWOZLSUZRXEZLBORSQ . You are receiving this email because you are subscribed to this thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

gadomski commented 10 months ago

Tom is correct, if you only want to return five items, use max_items. A couple of other things:

In the above example I would just expect to the api to return 5 items per page.

It should, but to check this you need to:

for page in my_search.pages_as_dicts():
    print(len(page))

In this line:

print(my_search.url_with_parameters())

During paging, the search object is not updated with the paging parameters, so url_with_parameters will not change while paging. See https://github.com/stac-utils/pystac-client/blob/4ea6dac3a4cc817854e8fbcb1a9f041f079655b1/pystac_client/stac_api_io.py#L282-L312 for the relevant code.

iliion commented 10 months ago

Ok I understand that the search request will return all pages and the limit will be the size of the each page and I get the number of items in each page from print(len(page['features']))

My problem is that the requests will go on infinitely when I ran the above example in my catalog. I understand that this is a bug on my part but I cant understand the reason. Maybe you have a clue why the requests from the client wont stop. Do i miss something in the api specification?

FYI: The api response follows the specs here (https://api.stacspec.org/v1.0.0/item-search/#tag/Item-Search)

iliion commented 10 months ago

I think I know what is wrong. stac_client does not support paging implemented with page=x parameter.

For the following request http://localhost:20008/search?limit=2&collections=test-collection The rel=next link will have this href -> http://localhost:20008/search?limit=2&collections=test-collection&page=1

Unfortunately the above url is parsed and the output is the following

{
   "rel":"next",
   "type":"application/json",
   "method":"POST",
   "href":"http://localhost:20008/search",
   "body":{
      "limit":2,
      "collections":[
         "test-collection"
      ],
      "token":1
   }
}
gadomski commented 10 months ago

Unfortunately the above url is parsed and the output is the following

I don't quite know what you mean by this. The read_text method doesn't make any assumptions about pagination -- it simply uses what the server returns: https://github.com/stac-utils/pystac-client/blob/4ea6dac3a4cc817854e8fbcb1a9f041f079655b1/pystac_client/stac_api_io.py#L128-L172

To continue debugging, can you provide the following:

iliion commented 10 months ago

My guess was read_json()

I will try to be more clear.

http://localhost:20008/search?limit=2&collections=test-collection

will output a response where the next link is like this:

{
  "rel":"next",
  "type":"application/json",
  "method":"GET",
  "href":"http://localhost:20008/search?limit=1&collections=test-collection&page=1"
}

If I run the following and print the response then I get something different

catalog = Client.open(url='http://localhost:20008')
my_search = catalog.search(collections='test-collection', limit = 1)

for page in my_search.pages_as_dicts():
        print(my_search.url_with_parameters())
        # -> http://localhost:20008/search?limit=1&collections=test-collection
        print(page['links'])

The page['links'] will output a response where the next link is this:

{
   "rel":"next",
   "type":"application/json",
   "method":"POST",
   "href":"http://localhost:20008/search",
   "body":{
      "limit":2,
      "collections":[
         "test-collection"
      ],
      "token":1
   }
}

The point is that the loop will not stop


DEBUG

. . .
REQUEST 0

DEBUG:pystac_client.stac_api_io:POST http://localhost:20008/search Headers: {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '60', 'Content-Type': 'application/json'} Payload: {"limit": 1, "collections": ["test-collection"], "token": 1}
send: b'POST /search HTTP/1.1\r\nHost: localhost:20008\r\nUser-Agent: python-requests/2.31.0\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 60\r\nContent-Type: application/json\r\n\r\n'
send: b'{"limit": 1, "collections": ["test-collection"], "token": 1}'
reply: 'HTTP/1.1 200 OK\r\n'
header: date: Wed, 22 Nov 2023 16:17:30 GMT
header: server: uvicorn
header: content-length: 1509
header: content-type: application/geo+json
header: content-encoding: br
header: vary: Accept-Encoding
DEBUG:urllib3.connectionpool:http://localhost:20008 "POST /search HTTP/1.1" 200 1509

REQUEST 1

DEBUG:pystac_client.stac_api_io:POST http://localhost:20008/search Headers: {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '48', 'Content-Type': 'application/json'} Payload: {"limit": 1, "collections": ["test-collection"]}
send: b'POST /search HTTP/1.1\r\nHost: localhost:20008\r\nUser-Agent: python-requests/2.31.0\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 48\r\nContent-Type: application/json\r\n\r\n'
send: b'{"limit": 1, "collections": ["test-collection"]}'
reply: 'HTTP/1.1 200 OK\r\n'
header: date: Wed, 22 Nov 2023 16:17:33 GMT
header: server: uvicorn
header: content-length: 1509
header: content-type: application/geo+json
header: content-encoding: br
header: vary: Accept-Encoding
DEBUG:urllib3.connectionpool:http://localhost:20008 "POST /search HTTP/1.1" 200 1509
<Item id=test-item-1>
DEBUG:pystac_client.stac_api_io:POST http://localhost:20008/search Headers: {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive', 'Content-Length': '60', 'Content-Type': 'application/json'} Payload: {"limit": 1, "collections": ["test-collection"], "token": 1}
send: b'POST /search HTTP/1.1\r\nHost: localhost:20008\r\nUser-Agent: python-requests/2.31.0\r\nAccept-Encoding: gzip, deflate, br\r\nAccept: */*\r\nConnection: keep-alive\r\nContent-Length: 60\r\nContent-Type: application/json\r\n\r\n'
send: b'{"limit": 1, "collections": ["test-collection"], "token": 1}'
reply: 'HTTP/1.1 200 OK\r\n'
header: date: Wed, 22 Nov 2023 16:17:33 GMT
header: server: uvicorn
header: content-length: 1509
header: content-type: application/geo+json
header: content-encoding: br
header: vary: Accept-Encoding
DEBUG:urllib3.connectionpool:http://localhost:20008 "POST /search HTTP/1.1" 200 1509
<Item id=test-item-1>
..  ..  .. (infinite loop).. .. .. 
gadomski commented 10 months ago

This is a problem with your server. pages_as_dicts does not modify the links attribute in any way: https://github.com/stac-utils/pystac-client/blob/4ea6dac3a4cc817854e8fbcb1a9f041f079655b1/pystac_client/item_search.py#L725-L749

Closing as not-an-issue-with-pystac-client, please re-open if you find otherwise.