jjjake / internetarchive

A Python and Command-Line Interface to Archive.org
GNU Affero General Public License v3.0
1.62k stars 218 forks source link

--fts ignores --parameters, --field, --sort #593

Open gingerbeardman opened 1 year ago

gingerbeardman commented 1 year ago

Hi,

I am doing ia search --parameters="..."

...but I do not know what parameters it accepts.

Is there a list or documentation anywhere?

My goal is to return a small number of results sorted by most recently "added" first.

But those do not seem to work with ia search, or maybe I am doing it wrong?

I have also tried

Any help appreciated.

Thanks!

gingerbeardman commented 1 year ago

OK, I figured it out and support seems to be missing, so I will rename the issue.

ia search 'hanafuda' --parameters rows:10 --field addeddate --sort "addeddate desc"

But...

ia search 'hanafuda' --fts --parameters rows:10 --field addeddate --sort "addeddate desc"

I am using:

jjjake commented 1 year ago

The confusion here is that ia search uses various endpoints depending on several things. It uses the Scrape API by default, Advanced Search when either rows or page parameters are specified, and our beta FTS API when either --fts or --dsl-fts are specified.

The reasoning behind this is because the Advanced Search API is not designed for scraping/retrieving full result sets (it's capable of doing so, but it's not designed for it). The Scrape API is designed for dumping full result sets. I assume that most people want full result sets when using ia search, and that's why the Scrape API is the default. When a user specifies that they only want a subset of the results (i.e. via page or rows params), then Advanced Search is used.

Then there's the FTS API. This is in beta, is not currently documented publicly, and is subject to change. The specific parameter you're after though is size as opposed to rows:

» ia search 'hanafuda' --fts --parameters size:10 | wc -l
      10

--fields is not currently supported with --fts, all indexed fields are returned by default. addeddate is not returned, but publicdate is (under .fields.meta_publicdate). Sorting is not supported in the beta FTS API at this time.

Sorry for the confusion. We hope to consolidate these endpoints in the future!

gingerbeardman commented 1 year ago

Thanks @jjjake very informative. I'll keep an eye on progress.

It seems very wasteful to query the whole set when I only want the most X recent (for example any new items since the last time I did the query). But maybe I'm overthinking it!? I prefer to keep things lean and save time and electricity on this earth.

chgans commented 1 year ago

The "beta FTS API" doesn't seem to point to the right endpoint. results from "ia search" are not the same as the one used by https://archive.org/search?query=... JS from this page uses https://archive.org/services/search/beta/page_production/, which return cleaner results.

Is there any plan to switch to that endpoint?

jjjake commented 1 year ago

@chgans be-api.us.archive.org/ia-pub-fts-api is the current recommendation from the developers of our FTS beta API. We do hope to consolidate our search endpoints in the future though. Thanks for checking!